CN113870949A - Deep learning-based nanopore sequencing data base identification method - Google Patents

Deep learning-based nanopore sequencing data base identification method Download PDF

Info

Publication number
CN113870949A
CN113870949A CN202111172443.5A CN202111172443A CN113870949A CN 113870949 A CN113870949 A CN 113870949A CN 202111172443 A CN202111172443 A CN 202111172443A CN 113870949 A CN113870949 A CN 113870949A
Authority
CN
China
Prior art keywords
base
sequence
data
base sequence
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111172443.5A
Other languages
Chinese (zh)
Other versions
CN113870949B (en
Inventor
汪国华
高文韬
邹权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Forestry University
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Northeast Forestry University
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Forestry University, Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Northeast Forestry University
Priority to CN202111172443.5A priority Critical patent/CN113870949B/en
Publication of CN113870949A publication Critical patent/CN113870949A/en
Application granted granted Critical
Publication of CN113870949B publication Critical patent/CN113870949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KGRAPHICAL DATA READING; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for recognising patterns
    • G06K9/62Methods or arrangements for pattern recognition using electronic means
    • G06K9/6267Classification techniques
    • G06K9/6268Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KGRAPHICAL DATA READING; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for recognising patterns
    • G06K9/62Methods or arrangements for pattern recognition using electronic means
    • G06K9/6267Classification techniques
    • G06K9/6268Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches
    • G06K9/6277Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches based on a parametric (probabilistic) model, e.g. based on Neyman-Pearson lemma, likelihood ratio, receiver operating characteristic [ROC] curve plotting a false acceptance rate [FAR] versus a false reject rate [FRR]
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Computing arrangements based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0454Architectures, e.g. interconnection topology using a combination of multiple neural nets
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Computing arrangements based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0472Architectures, e.g. interconnection topology using probabilistic elements, e.g. p-rams, stochastic processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Computing arrangements based on biological models using neural network models
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Abstract

A deep learning-based nanopore sequencing data base identification method relates to the field of bioinformatics, and aims at the problem of low accuracy of nanopore sequencing in the prior art, and comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set; II, secondly: carrying out base recognition on 50 groups of original data to obtain a base sequence; thirdly, the method comprises the following steps: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm; fourthly, the method comprises the following steps: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data; fifthly: and training a neural network by using the marked electric signal data and the original data, and performing base recognition by using the trained neural network. The method realizes the high-accuracy recognition of the base sequence of the nanopore sequencing data.

Description

Deep learning-based nanopore sequencing data base identification method
Technical Field
The invention relates to the field of bioinformatics, in particular to a deep learning-based nanopore sequencing data base identification method.
Background
The nanopore third generation sequencer available from Oxford corporation has the advantages of portability, low cost, long sequencing reads, etc., compared to the second generation sequencer and the third generation sequencer available from PacBio corporation. However, the accuracy of nanopore sequencing is much lower than the second generation sequencing technology and the HIFI sequencing technology of PacBio. The accuracy of the base recognition tool provided by the official part is only about 90 percent, and the method is not open source. The Nanopore of the Nanopore sequencer is essentially a nanoscale protein pore with voltage detection devices on both sides. In operation, primers are used to pull single-stranded DNA/RNA through the nanopore, causing different current changes when different types of nucleotides pass through the nanopore. The sequencer records all changes in current by translating the electrical signal into the corresponding base sequence. The nanopore is single-molecule sequencing, and the accuracy of base identification is greatly influenced by noise signals and random errors. The unloading data of the nanopore sequencer are divided into fasta and fast 5. Wherein the fasta is a gene sequence obtained after treatment by using an official base recognition tool (Guppy), and the accuracy rate is about 90 percent. The fast5 file contains the original electrical signal text acquired by the sequencer. Taking the official tool Guppy R9.4 as an example, 5 bases pass through the nanopore at a time, so there are 45-102 possible gene sequences. Further complications arise due to the presence of base modifications. The base modification known at present is 5mC, and if 5mC is taken as the base signal of 5 th class except A, C, G, T, 5 bases in a single pass through the nanopore are 553125 possible sequences. And the nucleotide and the nanopore are of nanoscale molecular structures, and the official base recognition tool cannot well predict the real base sequence through an electric signal. This is a major factor affecting the accuracy of nanopore sequencing. Thus, correlation using deep learningThe method constructs a model, and is very necessary to reliably predict the raw data of the nanopore sequencing.
Disclosure of Invention
The purpose of the invention is: aiming at the problem of low accuracy of the nanopore sequencing in the prior art, a deep learning-based nanopore sequencing data base identification method is provided.
The technical scheme adopted by the invention to solve the technical problems is as follows:
the deep learning-based method for identifying the base of the nanopore sequencing data comprises the following steps:
the method comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set;
step two: carrying out base recognition on 50 groups of original data to obtain a base sequence;
step three: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm;
step four: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data;
step five: and training a neural network by using the marked electric signal data and the original data, and performing base recognition by using the trained neural network.
Further, the neural network comprises a first convolution layer, a second convolution layer, a BERT module, a full connection layer and a CTC decoding module;
the first convolutional layer is used for down-sampling the marked electrical signal data,
the second convolutional layer is used for carrying out feature extraction on the electrical signal data after down sampling,
a BN layer is arranged behind the first convolution layer and the second convolution layer and used for preventing the mean value and the variance from being saturated,
the BERT module is used for training according to the extracted characteristics and outputting a base sequence corresponding to the electric signal data,
the full junction layer processes the base sequences corresponding to the electrical signal data by using a softmax function to obtain the probability of each base sequence corresponding to the original electrical signal,
the CTC decoding module processes the probability of each base sequence corresponding to the original electric signal to obtain a final base sequence,
the convolution kernel in the first convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the convolution kernel in the second convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the BERT module comprises 12 layers of transformers, 768-dimensional Embedding hidden layers and 12-head attention mechanism layers.
Further, the marked electrical signal data is characterized by:
wherein c represents sequencing data, xcRepresents the corresponding feature of the sequencing data, ω is the weight of the convolution kernel, where the parameter k is set to 3, i and j are the initial position of the sequence, T is the length of the sequence, and x represents the accumulation.
Further, the BN layer is represented as:
where α, β and ∈ are modelled parameters, xbnIs the sequence characteristic of the convolutional layer output, E is the function to calculate the expectation, and Var is the variance function.
Further, the softmax function is expressed as:
wherein z isiThe output value of the ith node is expressed, C is the number of classification categories, e is the base number of a natural logarithm function and is a mathematical constant, and Zc is the output value of the C node.
Further, the CTC decoding module specifically performs the following steps:
aiming at a predicted sequence output by a BERT layer, firstly, a candidate base sequence is generated by iteration by using a beacon search algorithm, the beam width is 3, then, the candidate bases are scored, blank characters and redundant characters in the base sequence are removed, the base sequence with the highest score is selected as a final prediction result,
the probability of blank characters existing in the base sequence is:
x is the output sequence of the BERT layer, pi represents the path corresponding to the intermediate result, beta-1(l) Represents all paths satisfying the condition in the searching process of the algorithm, I is the output result, P (I | x) represents the probability of blank characters in the sequence,
expressing the CTC loss function by using the base sequence space character probability, which is equal to the minimized logarithm field-ln (P pi | x)), and expressing the CTC loss function as follows:
where ln () represents the natural logarithm.
Further, the base recognition of the 50 sets of raw data in the second step is performed by a base recognition tool Guppy.
The invention has the beneficial effects that:
(1) the invention uses a deep neural network model with better performance, introduces the idea of solving the problem of natural language processing into the base recognition of the nanopore sequencing data, and has better performance compared with an official base recognition tool.
(2) The invention provides a good basis for genomics research, and the high-accuracy base identification is beneficial to the analysis of downstream genome data.
(3) The model of the invention has better generalization performance and is suitable for the base recognition of the nanopore sequencing data of various species including microorganisms, plants, animals and the like.
The method comprises the steps of utilizing a convolutional layer to carry out down-sampling and feature extraction on nanopore electric signal data, utilizing a BERT module to predict a base sequence corresponding to an electric signal, and utilizing a CTC algorithm to remove redundant data. Realizing the high-accuracy recognition of the base sequence of the nanopore sequencing data.
Drawings
FIG. 1 is a flow chart of a method for base recognition of nanopore sequencing data based on a deep neural network model according to an embodiment of the present application;
FIG. 2 is a diagram illustrating the effect of the deep neural network model of the present application;
FIG. 3 is a graphical representation of the comparison of the accuracy of the present application with official base recognition tools and Guppy-KP on the test set;
FIG. 4 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on a test set 1;
FIG. 5 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 2;
FIG. 6 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 3;
FIG. 7 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 4;
FIG. 8 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 5;
FIG. 9 is a schematic illustration of comparison of the sequence identity indicators of the present application with official base recognition tools on a test set 6;
FIG. 10 is a schematic diagram 7 comparing the sequence identity indicators on the test set of the present application with the official base recognition tool;
FIG. 11 is a schematic diagram 8 comparing the sequence identity indicators on the test set of the present application with official base recognition tools;
FIG. 12 is a schematic diagram of a comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 9;
FIG. 13 is a graph showing the comparison of error rates on a test set of 9 species for the present application and the official base recognition tool.
Detailed Description
It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.
The first embodiment is as follows: the method for recognizing a base based on deep learning nanopore sequencing data according to this embodiment will be described in detail with reference to FIG. 1.
As shown in FIG. 1, the method comprises the following steps S1-S8:
s1, downloading 50 groups of raw data of nanopore including pneumococcus (Klebsiella pneumoniae), Enterobacter (Enterobacteriaceae), Proteobacteria (Proteobacteria) and sequencing data of 9 fungi to form a data set.
Wherein, 50 groups of obtained nanopore original sequencing data are used as a training set of the model, and the gene sequences of the other 9 species are used as a test set.
S2, base recognition was performed on 50 sets of raw data using the nanopore official base recognition tool Guppy.
The official base recognition tool Guppy was used to convert unknown nanopores into base sequences for finding their corresponding next generation sequenced reference genomes.
And S3, correcting the base sequence after Guppy processing by using the Tombo algorithm and annotating the corrected sequence by using the dynamic time warping algorithm by using the Illumina sequencing sequence as a reference genome.
S4, converting the real DNA sequence into a real electric signal by adopting a Re-squiggle method, and generating the marking data in a (base sequence, electric signal) format.
S5, constructing a neural network model based on the convolutional neural network and the BERT network, wherein the model comprises two convolutional layers, a BERT module, a full connection layer and a CTC decoding module. And performing feature extraction on the input sequence by using a convolution module. The method adopts two convolutional layers to carry out preprocessing and feature extraction on input sequence data, and comprises the following steps:
s51, the size of the convolution kernel in the first convolution layer is 1 x 3, the step size is 1 x 2, the output channel is 128, and the convolution kernel is used for down-sampling data and reducing the calculation complexity.
S52, the size of convolution kernel in the second convolution layer is 1 x 3, the step size is 1 x 2, the output channel is 128, and the method is used for feature extraction. The input signal vector x is calculated as follows:
and S53, a Batch Normalization (BN) layer is arranged behind each convolution module and used for preventing the mean value and the variance from being saturated and improving the generalization performance of the model. The calculation formula is as follows:
and S6, inputting the extracted features into a BERT module, and outputting the probability of each base sequence corresponding to the nanopore original electric signal after full-connection layer processing. And inputting the features extracted after the down sampling into a BERT module for training. The BERT module contains 12 layers of Transformer, 768 dimensions of Embedding hidden layers and 12-headed attention mechanism layers. This is followed by a full ligation layer and the probability of a base at each position is calculated using the softmax function. i and j respectively represent the sequence number of the sequence, and xi + j represents the sequence characteristics extracted after each character in the sequence is convolved and is used as the input of the subsequent BERT layer. The calculation formula is as follows:
and S7, removing the repeated base sequence and the blank sequence by using a CTC decoding module, and finally outputting the high-accuracy nanopore base sequence. The high-order feature distribution distance between the nanopore original electrical signal and the base sequence was calculated using the CTC loss function. The CTC decoder iteratively generates candidate base sequences using a beam search algorithm, wherein the beam width is 3, and then scores the candidate bases. Blank characters in the sequences are removed in the process, and the base sequences with the highest scores are selected as final prediction results. The base sequence with high accuracy is obtained from the raw signal data of the nanopore. e is the base of the natural logarithmic function and is a mathematical constant. Zc represents the output value of the c-th node. The probability calculation process of blank characters existing in the base sequence and the CTC loss function formula are as follows:
L(S)=-lnΠ(x,z)∈Sp(z|x)=-∑(x,z)∈Slnp(z|x)
wherein x represents an input sequence, π represents a path of a base searched by a beamsearch, and z represents an output sequence.
S8, converting the original electric signal of the nanopore sequencing into a base sequence with higher accuracy than that of an official tool by adopting a trained prediction model.
The recognition effect of the present invention is further described below with a set of specific experimental examples.
First, to evaluate the performance of the base recognition tools, we performed comparative analysis on 4 base recognition tools including our model on the same dataset. Wherein, the ourethod represents the deep neural network model of the invention, Guppy and Albacore are official base recognition tools of Oxford formula, and Guppy-KP is a model retrained on an official basis.
Table one shows the error rates on the test set for 4 tools including our method.
Wherein deletion, insertion and mismatch respectively represent deletion error, insertion error and matching error of sequencing data. The base recognition accuracy was defined as follows:
m represents the number of bases matched, S represents the number of bases with matching errors, I represents the number of bases with insertion errors, and D represents the number of bases with deletion errors. On the Klebsiella Pneumoniae NUH29 dataset, the error rate of the method of the present invention was 11.06%, lower than that of other base recognition tools. On the Klebsiella Pneumoniae KSB2 dataset, the error rates of the method of the invention, Albore, Guppy were 11.26%, 15.80%, 15.73%, respectively, which were lower than the official base recognition tool.
Secondly, we also used genome assembly consistency as an index to evaluate model performance. FIG. 4 shows the consensus sequence identity of the 4 base recognition tools comprising the present invention. We used 6 indicators of polymer insertion errors, other insertion errors, polymer deletion errors, other deletion errors, substitution errors, and Dcm errors to evaluate model performance.
Performance evaluation on a test set shows that the base recognition error rate and genome assembly consistency index of the invention are superior to those of base recognition tools provided by the official part.
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.

Claims (7)

1. The deep learning-based method for identifying the base of the nanopore sequencing data is characterized by comprising the following steps of:
the method comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set;
step two: carrying out base recognition on 50 groups of original data to obtain a base sequence;
step three: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm;
step four: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data;
step five: and training a neural network by using the marked electric signal data and the original data, and performing base recognition by using the trained neural network.
2. The deep learning based nanopore sequencing data base identity method of claim 1, wherein the neural network comprises a first convolutional layer, a second convolutional layer, a BERT module, a fully-connected layer and a CTC decoding module;
the first convolutional layer is used for down-sampling the marked electrical signal data,
the second convolutional layer is used for carrying out feature extraction on the electrical signal data after down sampling,
a BN layer is arranged behind the first convolution layer and the second convolution layer and used for preventing the mean value and the variance from being saturated,
the BERT module is used for training according to the extracted characteristics and outputting a base sequence corresponding to the electric signal data,
the full junction layer processes the base sequences corresponding to the electrical signal data by using a softmax function to obtain the probability of each base sequence corresponding to the original electrical signal,
the CTC decoding module processes the probability of each base sequence corresponding to the original electric signal to obtain a final base sequence,
the convolution kernel in the first convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the convolution kernel in the second convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,
the BERT module comprises 12 layers of transformers, 768-dimensional Embedding hidden layers and 12-head attention mechanism layers.
3. The deep learning-based nanopore sequencing data base identity recognition method of claim 2, wherein the labeled electrical signal data characteristics are represented as:
wherein c represents sequencing data, xcRepresents the corresponding feature of the sequencing data, ω is the weight of the convolution kernel, where the parameter k is set to 3, i and j are the initial position of the sequence, T is the length of the sequence, and x represents the accumulation.
4. The deep learning-based nanopore sequencing data base identity method of claim 3, wherein the BN layer is represented as:
where α, β and ∈ are modelled parameters, xbnIs the sequence characteristic of the convolutional layer output, E is the function to calculate the expectation, and Var is the variance function.
5. The deep learning-based nanopore sequencing data base identity method of claim 4, wherein the softmax function is expressed as:
wherein z isiThe output value of the ith node is expressed, C is the number of classification categories, e is the base number of a natural logarithm function and is a mathematical constant, and Zc is the output value of the C node.
6. The deep learning-based nanopore sequencing data base identity method of claim 5, wherein the CTC decoding module specifically performs the following steps:
aiming at a predicted sequence output by a BERT layer, firstly, a candidate base sequence is generated by iteration by using a beacon search algorithm, the beam width is 3, then, the candidate bases are scored, blank characters and redundant characters in the base sequence are removed, the base sequence with the highest score is selected as a final prediction result,
the probability of blank characters existing in the base sequence is:
x is the output sequence of the BERT layer, pi represents the path corresponding to the intermediate result, beta-1(l) Represents all paths satisfying the condition in the searching process of the algorithm, I is the output result, P (I | x) represents the probability of blank characters in the sequence,
expressing the CTC loss function using the base sequence space character probability, equal to the minimized log domain-ln (P (π | x)), as:
where ln () represents the natural logarithm.
7. The method for base recognition based on deep learning nanopore sequencing data of claim 1, wherein the base recognition of the 50 sets of raw data in the second step is performed by a base recognition tool Guppy.
CN202111172443.5A 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method Active CN113870949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111172443.5A CN113870949B (en) 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111172443.5A CN113870949B (en) 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method

Publications (2)

Publication Number Publication Date
CN113870949A true CN113870949A (en) 2021-12-31
CN113870949B CN113870949B (en) 2022-05-17

Family

ID=79002054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111172443.5A Active CN113870949B (en) 2021-10-08 2021-10-08 Deep learning-based nanopore sequencing data base identification method

Country Status (1)

Country Link
CN (1) CN113870949B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243674A (en) * 2020-01-08 2020-06-05 华南理工大学 Method, device and storage medium for identifying base sequence
US20200303038A1 (en) * 2019-03-19 2020-09-24 The University Of Hong Kong Variant calling in single molecule sequencing using a convolutional neural network
CN112183486A (en) * 2020-11-02 2021-01-05 中山大学 Method for rapidly identifying single-molecule nanopore sequencing base based on deep network
CN113361522A (en) * 2021-06-23 2021-09-07 北京百度网讯科技有限公司 Method and device for determining character sequence and electronic equipment
US20210312906A1 (en) * 2020-04-07 2021-10-07 International Business Machines Corporation Leveraging unpaired text data for training end-to-end spoken language understanding systems

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200303038A1 (en) * 2019-03-19 2020-09-24 The University Of Hong Kong Variant calling in single molecule sequencing using a convolutional neural network
CN111243674A (en) * 2020-01-08 2020-06-05 华南理工大学 Method, device and storage medium for identifying base sequence
US20210312906A1 (en) * 2020-04-07 2021-10-07 International Business Machines Corporation Leveraging unpaired text data for training end-to-end spoken language understanding systems
CN112183486A (en) * 2020-11-02 2021-01-05 中山大学 Method for rapidly identifying single-molecule nanopore sequencing base based on deep network
CN113361522A (en) * 2021-06-23 2021-09-07 北京百度网讯科技有限公司 Method and device for determining character sequence and electronic equipment

Also Published As

Publication number Publication date
CN113870949B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
Chuzhanova et al. Feature selection for genetic sequence classification.
Asai et al. Prediction of protein secondary structure by the hidden Markov model
Sato et al. RNA secondary structural alignment with conditional random fields
CN107403075B (en) Comparison method, device and system
US20050273274A1 (en) Method for identifying sub-sequences of interest in a sequence
Zeng et al. Causalcall: nanopore basecalling using a temporal convolutional network
JP4912646B2 (en) Gene transcript mapping method and system
US20190073443A1 (en) Methods and systems for producing an expanded training set for machine learning using biological sequences
CN112183486A (en) Method for rapidly identifying single-molecule nanopore sequencing base based on deep network
CN112582030A (en) Text storage method based on DNA storage medium
CN111026877A (en) Knowledge verification model construction and analysis method based on probability soft logic
CN112687332A (en) Method, apparatus and storage medium for determining sites of variation at risk of disease
Giancarlo et al. Textual data compression in computational biology: Algorithmic techniques
CN113870949B (en) Deep learning-based nanopore sequencing data base identification method
Azad et al. Effects of choice of DNA sequence model structure on gene identification accuracy
CN111597400A (en) Computer retrieval system and method based on way-finding algorithm
Zhang et al. Biomedical Named Entity Recognition Based on Self‐supervised Deep Belief Network
Rusinova et al. Model Formalization for Genomes Comparative Analysis Using a Graph Database
Böer Multiple alignment using hidden Markov models
CN112687328B (en) Method, apparatus and medium for determining phenotypic information of clinical descriptive information
Ganesh et al. MOPAC: motif finding by preprocessing and agglomerative clustering from microarrays
Nguyen et al. Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model
CN112530414B (en) Iterative large-scale pronunciation dictionary construction method and device
Tang et al. Sequence Fusion Algorithm of Tumor Gene Sequencing and Alignment Based on Machine Learning
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant