CN114187963A - Prediction method of protein binding nucleotide sites on full-length circular RNA - Google Patents
Prediction method of protein binding nucleotide sites on full-length circular RNA Download PDFInfo
- Publication number
- CN114187963A CN114187963A CN202111501583.2A CN202111501583A CN114187963A CN 114187963 A CN114187963 A CN 114187963A CN 202111501583 A CN202111501583 A CN 202111501583A CN 114187963 A CN114187963 A CN 114187963A
- Authority
- CN
- China
- Prior art keywords
- circrna
- full
- length
- binding
- nucleotide
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000002773 nucleotide Substances 0.000 title claims abstract description 73
- 125000003729 nucleotide group Chemical group 0.000 title claims abstract description 73
- 230000027455 binding Effects 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 title claims abstract description 23
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 22
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 22
- 108091028075 Circular RNA Proteins 0.000 title description 4
- 101710146873 Receptor-binding protein Proteins 0.000 claims abstract description 20
- 102100024544 SURP and G-patch domain-containing protein 1 Human genes 0.000 claims abstract description 20
- 238000001914 filtration Methods 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 17
- 230000003993 interaction Effects 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 16
- 239000012634 fragment Substances 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000014509 gene expression Effects 0.000 claims description 5
- 238000012805 post-processing Methods 0.000 claims description 4
- 238000013135 deep learning Methods 0.000 claims description 3
- 230000007774 longterm Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000004927 fusion Effects 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 210000002569 neuron Anatomy 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 101710159080 Aconitate hydratase A Proteins 0.000 description 38
- 101710159078 Aconitate hydratase B Proteins 0.000 description 38
- 102000044126 RNA-Binding Proteins Human genes 0.000 description 38
- 101710105008 RNA-binding protein Proteins 0.000 description 38
- 238000013527 convolutional neural network Methods 0.000 description 9
- 238000013136 deep learning model Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 108091033409 CRISPR Proteins 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 108020005004 Guide RNA Proteins 0.000 description 2
- 101000781865 Homo sapiens Zinc finger CCCH domain-containing protein 7B Proteins 0.000 description 2
- 102100036643 Zinc finger CCCH domain-containing protein 7B Human genes 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 102000008682 Argonaute Proteins Human genes 0.000 description 1
- 108010088141 Argonaute Proteins Proteins 0.000 description 1
- 206010005003 Bladder cancer Diseases 0.000 description 1
- 238000010354 CRISPR gene editing Methods 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 108020004414 DNA Proteins 0.000 description 1
- 108010042407 Endonucleases Proteins 0.000 description 1
- 102000004533 Endonucleases Human genes 0.000 description 1
- 102100033994 Heterogeneous nuclear ribonucleoproteins C1/C2 Human genes 0.000 description 1
- 101001017574 Homo sapiens Heterogeneous nuclear ribonucleoproteins C1/C2 Proteins 0.000 description 1
- 101000984042 Homo sapiens Protein lin-28 homolog A Proteins 0.000 description 1
- 101001073409 Homo sapiens Retrotransposon-derived protein PEG10 Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 102100025460 Protein lin-28 homolog A Human genes 0.000 description 1
- 102100035844 Retrotransposon-derived protein PEG10 Human genes 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000008827 biological function Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000010362 genome editing Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 230000002687 intercalation Effects 0.000 description 1
- 238000009830 intercalation Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 201000005112 urinary bladder cancer Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Chemical & Material Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
A prediction method of protein binding nucleotide sites on full-length circRNA comprises the steps of cutting the full-length circRNA into segments, inputting the segments into a one-dimensional CNN network, respectively inputting obtained local high-level abstract features into a BiGRU network and a double-branch network of a transform encoder to respectively obtain long dependence representation features of input data and circRNA sequence representation based on global attention, inputting the representation features and the circRNA sequence representation into an MLP classifier after splicing, finally removing false binding nucleotides according to binding information of adjacent nucleotides through median filtering to reduce false positive rate, obtaining predicted binding nucleotides through a fractional binarization strategy, and identifying key sequence content through integral gradient to obtain predicted full-length circRNA and RBP binding motif. The invention can explore the RBP binding condition on the full-length circRNA by nucleotide resolution, accurately predict the RBP binding nucleotide and detect the binding motif of the RBP binding nucleotide.
Description
Technical Field
The invention relates to a technology in the field of bioengineering, in particular to a prediction method of a protein binding nucleotide site on a full-length circular RNA based on an interpretable Transformer deep learning method.
Background
Circular RNA (circrna) interacts with RNA Binding Protein (RBP) to regulate gene expression and perform biological functions. For example, circRNACiiRS-7 functions as a miRNA-7 sponge by binding to AGO proteins. At the same time, RBP plays a crucial role in many biological processes. For CRISPR/Cas9 genome editing technology, guide rna (grna) binds Cas9 protein for regulation, activating endonuclease activity on the DNA target. The protein PEG10 binds and packages RNA as an RNA delivery vehicle. Furthermore, the interaction between circCDYL and RBP affects cancer pathways associated with bladder cancer [5 ]. Therefore, recognition of the interaction between circRNA and RBP allows a deep understanding of the function of RBP and circRNA, further revealing the mechanisms behind the disease.
With the development of high throughput sequencing technology, many circumcrnas that bind to targets, including RBPs, have been collected. Based on the cumulative binding targets, RBP-specific machine learning methods have been proposed to predict RBP binding sites on linear and circrnas, and they typically train a model for each RBP. For linear RNA, GraphProt first encodes the sequence and predicted structure into a map that is input into a support vector machine for RBP binding site classification. iONMF uses orthogonal matrix decomposition to integrate multiple signature sources to predict RBP binding sites. Deepbind trains Convolutional Neural Networks (CNN) to infer the binding sequence preference of RBPs. IDEEP [26] further employs hybrid CNN and long-short term memory neural networks (LSTMs) to simultaneously learn RBP binding sequences and structural preferences, where LSTM captures long-term dependencies between sequences and structures. Similarly, DeepCLIP also trained a CNN and LSTM mixed model to predict the effect of mutations on RNA-protein interactions. For circRNA, CRIP designed a stacked codon-based coding scheme with a mixed network of CNNs and LSTMs to predict RBP binding sites on circRNA. PASSION designed an integrated neural network to recognize the RBP binding site on circRNA. The iCircRBP DHN constructs a deep network that predicts RBP binding sites on circRNA using both intercalation and K-tuple nucleotide frequencies. ideep c uses a twin network to treat poorly characterized RBPs with some bound circRNA targets. Furthermore, CNN-based models can extract binding motifs from the learning core of CNN, but these motifs do not provide nucleotide level interpretation. Since training of the deep learning model is very time-consuming, some easy-to-use online Web servers, namely RBPsuite and DeepCLIP, which are implemented by using CRIP and IDEEP as online services, are developed.
The prior art provides only relatively low resolution binding regions of a predefined number of nucleotides. They are primarily concerned with predicting whether RNA fragments contain RBP binding sites, where these fragments are subsequences centered on the binding site or nucleotides. For example, GraphProt constructs a training fragment by extending the binding site of the downstream and upstream regions by 150 nucleotides. iONMF collected training fragments of 101bp in fixed length by extending the binding nucleotides 50 nucleotides in both directions. However, the model trained on these data focuses on predicting the enriched binding region with a resolution of approximately 100bp and does not allow accurate localization of the binding nucleotide on the full length circRNA. To apply the trained model to the full-length circRNA, it is also possible to first scan the full-length circRNA into fragments by a first step using a sliding window of fixed size, and then predict the binding fraction of individual fragments using the trained model. However, these strategies suffer from a high false positive rate and process each nucleotide independently when making a decision, which ignores the binding information of neighboring nucleotides. Therefore, there is a need to develop computational tools to accurately and precisely model RBP binding at the nucleotide level.
Disclosure of Invention
Aiming at the problems that the prior art calculates the prediction of the RBP binding site on the circRNA focuses on the circRNA fragment rather than the full-length circRNA, and the prior art cannot determine the specific position of the binding site and the defects of how much of the full-length circRNA is bound with the protein binding site region, the invention provides the prediction method of the protein binding nucleotide site on the full-length circRNA, and the RBP binding tag of the circRNA region (a plurality of continuous nucleotides) and the full-length circRNA (a circRNA transcript) is further deduced by predicting the RBP binding landscape on the full-length circRNA at the nucleotide resolution level through an interpretable deep learning model (circSite). Is superior to the best method at present in predicting the binding nucleotides and capturing the binding sequence motifs on the full-length circRNA.
The invention is realized by the following technical scheme:
the invention relates to a prediction method of protein binding nucleotide sites on full-length circRNA based on interpretable Transformer deep learning, which comprises the steps of cutting the full-length circRNA into segments, inputting the segments into a one-dimensional CNN network, respectively inputting obtained local high-level abstract features into a BiGRU network and a double-branch network of a Transformer encoder to respectively obtain long dependence representation features of input data and circRNA sequence representation based on global attention, inputting the segments into an MLP classifier after splicing, finally removing certain false binding nucleotides according to binding information of adjacent nucleotides through median filtering to reduce false positive rate, obtaining predicted binding nucleotides through a fractional binarization strategy, identifying key sequence content through integral gradient, obtaining predicted binding motif of the full-length circRNA and RBP, and realizing the identification of the nucleotide sites.
The invention relates to a prediction system for realizing the method, which comprises the following steps: the device comprises a data processing unit, a feature extraction unit, a dual-branch network unit, a classification layer unit and prediction result post-processing, wherein: the data processing unit performs sliding window processing according to the input full-length circRNA to obtain fragments of the circRNA with equal length and encodes the fragments into a matrix by using one-hot; the feature extraction unit performs one-dimensional convolution feature extraction processing according to the data processing result to obtain a word vector represented by a sequence; the double-branch network unit carries out the processing of the BiGRU and Transformer double-branch network according to the information after the characteristic extraction to obtain the sequence expression obtained by the two branches; the classification layer unit carries out MLP classification processing according to the information obtained by the double branch network to obtain the probability of the interaction between the nucleotide and the RBP; the prediction result post-processing unit generates a full-length circRNA and RBP binding motif based on the obtained probability of likelihood.
Technical effects
Compared with the low-resolution prediction of predicting whether the RNA segment can interact with the RBP or not by the conventional technical means, the method has the advantages that the nucleotide level and the RBP interaction are predicted in a high-resolution mode, the target slides on the full-length circRNA in a window sliding mode, the label of the nucleotide in the middle of the window is defined, and the capability of network learning whether the high-resolution nucleotide interacts with the RBP or not is forced; seven nucleotides were taken as a block by a deep learning model (CircSite), mapped into a vector by one-dimensional convolution, and input into the BiGRU and transform branches, respectively. The local representation learned by the BiGRU and the global representation learned by the Transformer are spliced to serve as input for predicting RBP-binding nucleotides; median filtering the predicted binding fraction of a single nucleotide on the full-length circRNA to obtain a smooth map of a predicted signal, and then using a binarization strategy to determine whether the nucleotide can interact with the RBP; mining and visualizing the binding motif of RBP using integral gradients, these detected motif's are well aligned with known motif's, and CircSite can predict not only the exact RBP binding nucleotide on the full-length circRNA, but also whether a full-length circRNA can interact with a given RBP.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a comparison of the performance of the present invention with CNN-BiGRU, Transformer and stochastic classifiers on a test set at the nucleotide level;
in the figure: A) auROC based on the raw prediction scores of individual nucleotides of the MLP classification layer in the deep learning model (CircSite); B) auPRC based on initial prediction scores of MLP classification layers in CircSite for individual nucleotides, C) AuPRC comparison with or without median filtering of CircSite; D) the MCC, F1 score, recall rate, precision and accuracy index are based on the result of final score binarization determined by the CircSite; E) performing MCC comparison on the score binarization of the CircSite and a default threshold value of 0.5; F) f is an F1 score comparison using score binarization and a default threshold of 0.5; G. h and I are the initial predicted scores for the single nucleotides for which full-length hsa _ circ _0072336, hsa _ circ _0072414, and hsa _ circ _0072416, respectively, interacted with AGO 1;
FIG. 3 is a graph of the output of the deep learning model (CircSite) of the present invention on circRNA including predicted initial scores, median filtering and score binarization;
in the figure: A. b, C are the results of initial score, median filtering and score binarization respectively predicted by ZC3H7B on hsa _ circ _ 0083799; D. e, F is the result of initial score, median filtering and score binarization predicted by ZC3H7B on hsa _ circ _0083834, respectively;
FIG. 4 is a graph of the importance of individual nucleotides to circRNA by analyzing the deep learning model of the present invention;
in the figure: the left hand side is the motif obtained from the CISBP-RNA database and the right hand side is the motif detected by CircSite on full-length circRNA, a larger value indicating a larger contribution of nucleotides to the prediction of bound nucleotides.
Detailed Description
In this example, a full-length circRNA dataset of 37 RBPs was constructed, and a full-length circRNA independent test set was also constructed in order to detect whether the algorithm interacted with the protein throughout the whole circRNA. As shown in fig. 1, in the prediction method for a protein-bound nucleotide site on a full-length circRNA according to this embodiment, first, circrnas segmented into segments are input to a one-dimensional CNN, the output segments are respectively input to BiGRU and a Transformer, then sequence representations are spliced and input to an MLP classifier, and then a median filtering and fractional binarization strategy is employed to obtain a nucleotide interaction fraction to obtain a predicted full-length circRNA-RBP binding motif, so as to implement recognition of the nucleotide site, where L is the number of repetitions of a corresponding block; LN is layer normalization; h ist,Z0The output of the last time step of the BiGRU, respectively, is a flag set at dimension 0 of the input matrix; a is the predicted binding fraction of a single nucleotide on the full-length circRNA, B is the result after median filtering smoothing, and C is the finally obtained interaction fraction of the binding nucleotide on the circRNA after fractional binarization; the method specifically comprises the following steps:
s1, collecting full-length circRNA data as a reference data set, dividing the data set into a training set and a testing set, wherein the data set is 120000 full-length circRNA sequences (https:// circinteractor. nia. nih. gov. /) extracted from a circinteraction object database and having 37 RBPs, for each RBP, the example firstly divides the bound full-length sequences into the training set and the testing set according to the proportion of 8: 2. Considering that high sequence similarity in the training set and the test set may lead to overestimation performance, redundant sequences in the test set are deleted using CD-HIT with a similarity threshold of 0.8;
s2, defining positive and negative samples by using a sliding window on the full-length circRNA, and specifically comprising the following steps:
s21, setting a sliding window to be 65;
s22, when the nucleotide in the middle position of the window can interact with the designated protein, the segment defining the sliding window is a positive sample, otherwise, the segment is a negative sample;
and S3, mapping vector representation of the convolution kernel size segments by using one-dimensional convolution, wherein the number of the one-dimensional convolution kernels is 384, and the length of the one-dimensional convolution kernels is 7.
S4, respectively putting the vector representation obtained in the step S3 into a Transformer encoder and a BiGRU;
the Transformer encoder comprises: a Position Encoding (PE) unit, a multi-headed attention mechanism unit, a Layer Normalization (LN) unit, and a feed forward block unit, wherein: the PE unit maintains the relative or absolute position of the word in the sequence; the multi-head attention mechanism unit passes through three matrixes WQ,WK,WVMultiplication with the word vector yields three values Q, K and V, respectively, i.e. MultiHead(Q,K,V)=Concat(head1,head2,...,headh)W0Self-attention toWherein: dkIs the dimension of matrix K; LN unit based on multi-head attention mechanismAnd carrying out layer normalization processing on the obtained information to obtain the same conversion that all the inputs are in the same interval range for the training of the neurons of the whole layer. And the feedforward block unit performs full-connection layer nonlinear transformation according to the information processed by the LN unit to obtain a result of the model with stronger expression capability.
The position code has the same dimension as the word embedding, and can be obtained through training or predefining, specifically:wherein: pos denotes the position of the word in the sequence, d denotes the dimension of the PE, 2i denotes the even index, and 2i +1 denotes the odd index (i.e., 2i ≦ d, 2i +1 ≦ d).
The Layer Normalization (LN) operation has a size of [ C, H, W ]]Wherein C, H and W are the channel number, the height and the width of the characteristic diagram respectively.Wherein: e and Var are the mean and variance of the input data, respectively, and epsilon is a very small number to prevent errors where the denominator is zero.
The feedforward unit comprises two full-connection layers, the activation function is a GELU, and the activation function specifically comprises the following steps: output ═ GELU (mlp (x)),
the BiGRU comprises: an update gate and a reset gate, wherein: the update gate determines how much information of the old state is copied into the new state and can capture long-term dependencies in the sequence; the reset gate determines how much information of the old state should be remembered and can capture short-term correlations in the sequence.
S5, a signature from the transform encoder 0 position and a vector representation of the BiGRU last time step after splicing as a representation of the input RNA sequence, wherein: the flag of the position of the transform encoder 0 is to put a random vector as a flag at the position of the input matrix 0 before being input into the encoder.
SaidSplicing specifically comprises the following steps: v ═ Concat (X)1,X2) Wherein: v is a fusion representation of circRNA, X1And X2Are representations of BiGRU and Transformer learning, respectively.
S6, splicing the two vectors in the step S5, and putting the two vectors into a full connection layer to obtain the probability of the interaction between RNA and protein.
S7, determining the binding region of the full-length circRNA and the protein through median filtering and fractional binarization according to the interaction probability of each nucleotide of the full-length circRNA and the protein obtained in the step S6, wherein: the median filter uses a window size of 46; and performing score binarization, wherein the threshold value is obtained by a verification set of each data set.
In the embodiment, a one-dimensional convolution network is adopted, and the sizes of convolution kernels, the number of the convolution kernels and the step length are 7, 384 and 1 in sequence; the input dimension and the output dimension of the GRU network are 384 and 768 respectively; the depth and attention head of the Transformer network are both 12, and the dimension of the weight matrix Q, K and V is [384, 384%]MLP has a dimension of [384, 768];Linear1、Linear2、Linear3The input and output dimensions of the network are 768, 384 respectively]、[768,384]And [768, 1]。
In the embodiment, the last layer uses a sigmoid activation function, and the initial learning rate is 10-4The Adam optimizer of (1) uses a two-value cross entropy loss function, and the training time batch size is 128. The present embodiment uses an early-stop mechanism during training, i.e. when the loss on the 5 epoch validation sets does not decrease, the training is stopped to prevent overfitting during training.
The window size for median filtering employed in this embodiment is 23.
The binarization strategy adopted in this embodiment specifically includes:
the integral gradient method of the embodiment comprises the following steps:
the input to CircSite is a one-hot coding matrix with only two types of numbers: 0 and 1. We divide 1 equally into 100 parts, resulting in 100 interpolation matrices. Theoretically, the more interpolation, the more accurate the calculation gradient changes, but the amount of calculation is certainly increased. We found by experiment that 100 interpolated parts are sufficient to draw a more accurate gradient profile.
Secondly, inputting the interpolation matrix into the trained depth model, and calculating the gradient change after the change of the size of each feature.
And thirdly, obtaining the importance of the features according to the integral values of the gradient lines of the 100 interpolation matrixes. Namely:
the evaluation indexes adopted in the present embodiment include:
(ii) based on a measure of nucleotide level:where auROC is the area under the ROC curve. Wherein,the sequence numbers (the probability scores are arranged from small to large and at the position of rank) representing the ith sample, M, N are the number of positive samples and the number of negative samples respectively;the sequence numbers of the positive samples are added up. AuPRC is the area under Precision and Recall curves.
Meanwhile, index evaluations such as ACC (accuracy), Precision, Recall, F1-score (F1 score), MCC (Markush correlation coefficient) and the like are also calculated.
(ii) a metric based on interaction region level: accuracy rate based on intersection area of real label and prediction area
F1 based on real label and prediction area binding areaBFraction of wherein F1Is a measure which simultaneously considers the accuracy and the recall ratio of the classification model
The predicted results of the experiments include:
in the experimental stage, the results of this example are shown in Table 1 below, compared with the current best results in the circRNA algorithm (Yang, Y., et a1., iCircRBP-DHN: identification of the circRNA-RBP interaction sites using deep scientific network. Brief Bioinfo, 2020.), and the example obtained excellent results in both the average evaluation indexes aurOC and aurRC. To more comprehensively evaluate the advantages of this example, various indices for predicting nucleotide levels were calculated, as shown in table 2; the results of two defined indicators of the CircSite zone level were also calculated as shown in table 3 below; to demonstrate the effect of CircSite, results were calculated predicting whether full-length RNA could interact with RBP on a separate full-length circRNA test set, as shown in table 4.
TABLE 1 comparison of performance of CircSite and iCircRBP-DHN on a full-length circRNA dataset of 37 RBPs
Table 2 is based on a measure of nucleotide levels, where Threshold is obtained from the validation set.
Table 3 is based on a measure of the level of the interaction region. a) When the real combination area is overlapped with the prediction area, the prediction is considered to be correct; b) when the overlap area between the real combination area and the prediction area accounts for more than 50% of the maximum area, the prediction is considered to be correct
Table 4 results of predicting whether full-length RNA can interact with RBP on the full-length circRNA independent test set.
In the construction of 37 independent RBP test sets to predict whether full-length circrnas interact with RBPs, where each full-length circRNA is used as a sample in calculating performance metrics, if a full-length circRNA has at least one region of interaction with an RBP, the prediction result of the full-length circRNA is 1, otherwise it is 0. As shown in Table 4, the average accuracy of CircSite in 37 RBPs was 0.716, MCC 0.461, accuracy 0.813, recall 0.595 and F1 score 0.668. The accuracy ranged from 0.615 for LIN28A to 0.841 for HNRNPC. Of the 37 RBPs, the accuracy of CircSite exceeds 0.7 for 20 RBPs. The results show that CircSite is not only superior in prediction of nucleotide levels, but also still superior in predicting whether full-length CircRNA interacts with RBPs.
In conclusion, the invention can explore the RBP binding condition on the full-length circRNA with nucleotide resolution, accurately predict RBP binding nucleotide and detect the binding motif.
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (7)
1. A prediction method for binding nucleotide sites on a protein on full-length circRNA based on interpretable Transformer deep learning is characterized in that the full-length circRNA is cut into segments and then input into a one-dimensional CNN network, obtained local high-level abstract features are respectively input into a BiGRU network and a double-branch network of a Transformer encoder to respectively obtain long dependence representation features of input data and circRNA sequence representation based on global attention, the segmented circRNA sequences are input into an MLP classifier, finally false binding nucleotides are removed according to binding information of adjacent nucleotides through median filtering to reduce false positive rate, predicted binding nucleotides are obtained through a fractional binarization strategy, key sequence contents are identified through integral gradient, predicted binding motifs of the full-length circRNA and RBP are obtained, and recognition of the nucleotide sites is achieved.
2. The method of predicting a protein-bound nucleotide site according to claim 1, wherein the Transformer encoder comprises: a Position Encoding (PE) unit, a multi-headed attention mechanism unit, a Layer Normalization (LN) unit, and a feed forward block unit, wherein: the PE unit maintains the relative or absolute position of the word in the sequence; the multi-head attention mechanism unit passes through three matrixes WQ,WK,WVMultiplication with the word vector yields three values Q, K and V, respectively, i.e.MultiHead(Q,K,V)=Concat(head1,head2,...,headh)W0Self-attention to Wherein: dkIs the dimension of matrix K; the LN unit carries out layer standardization processing according to the information obtained by the multi-head attention mechanism unit, and obtains the same conversion that all the inputs are in the same interval range for the training of a whole layer of neurons; and the feedforward block unit performs full-connection layer nonlinear transformation according to the information processed by the LN unit to obtain a result of the model with stronger expression capability.
3. The method of claim 2, wherein the position code has the same dimension as the word-embedding dimension, and can be obtained by training or predefining, specifically: wherein: pos represents the position of the word in the sequence, d represents the dimension of PE, 2i represents the even index, 2i +1 represents the odd index (i.e., 2i ≦ d, 2i +1 ≦ d);
the Layer Normalization (LN) operation has a size of [ C, H, W ]]Where C, H and W are the number of channels, height and width of the feature map, respectively.Wherein: e and Var are the mean and variance of the input data, respectively, and epsilon is a very small number to prevent errors where the denominator is zero;
4. the method of predicting a protein-bound nucleotide site as claimed in claim 1, wherein the BiGRU network comprises: an update gate and a reset gate, wherein: the update gate determines how much information of the old state is copied into the new state and can capture long-term dependencies in the sequence; the reset gate determines how much information of the old state should be remembered and can capture short-term correlations in the sequence.
5. The method for predicting the protein-bound nucleotide site according to any one of claims 1 to 4, which specifically comprises:
s1, collecting full-length circRNA data as a reference data set, dividing the data set into a training set and a testing set, wherein the data set is 120000 full-length circRNA sequences of 37 RBPs extracted from a CircInteractome database; for each RBP, firstly splitting the bound full-length sequence into a training set and a test set according to the ratio of 8: 2, and deleting redundant sequences in the test set by using CD-HIT with the similarity threshold of 0.8;
s2, defining positive and negative samples by using a sliding window on the full-length circRNA, and specifically comprising the following steps:
s21, setting a sliding window to be 65;
s22, when the nucleotide in the middle position of the window can interact with the designated protein, the segment defining the sliding window is a positive sample, otherwise, the segment is a negative sample;
s3, mapping vector representation of the convolution kernel size fragments by using one-dimensional convolution, wherein the number of the one-dimensional convolution kernels is 384, and the length of the one-dimensional convolution kernels is 7;
s4, respectively putting the vector representation obtained in the step S3 into a Transformer encoder and a BiGRU;
s5, a signature from the transform encoder 0 position and a vector representation of the BiGRU last time step after splicing as a representation of the input RNA sequence, wherein: the mark of the position of the Transformer encoder 0 is to put a random vector as a mark at the position of the input matrix 0 before the input of the encoder;
the splicing specifically comprises the following steps: v ═ Concat (X)1,X2) Wherein: v is a fusion representation of circRNA, X1And X2Are representations of BiGRU and Transformer learning, respectively;
s6, splicing the two vectors in the step S5, and putting the two vectors into a full connection layer to obtain the probability of the interaction of the RNA and the protein;
s7, determining the binding region of the full-length circRNA and the protein through median filtering and fractional binarization according to the interaction probability of each nucleotide of the full-length circRNA and the protein obtained in the step S6, wherein: the median filter uses a window size of 46;
the fraction is binarized, and the threshold value is obtained through a verification set of each data set;
the input dimension and the output dimension of the BiGRU network are 384 and 768 respectively; the depth and attention head of the Transformer network are both 12, and the dimension of the weight matrix Q, K and V is [384, 384%]MLP has a dimension of [384, 768];Linear1、Linear2、Linear3The input and output dimensions of the network are 768, 384 respectively]、[768,384]And [768, 1](ii) a The final layer uses sigmoid activation function, and the initial learning rate is 10-4The Adam optimizer of (1), the loss function uses a binary cross entropy loss function, and the batch size during training is 128; an early-stop mechanism is used in the training process, namely when the loss on the 5 epoch verification sets does not decrease, the training is stopped to prevent overfitting in the training process.
7. a nucleotide level resolution prediction system for implementing the method of any one of claims 1 to 6, comprising: the device comprises a data processing unit, a feature extraction unit, a dual-branch network unit, a classification layer unit and prediction result post-processing, wherein: the data processing unit performs sliding window processing according to the input full-length circRNA to obtain fragments of the circRNA with equal length and encodes the fragments into a matrix by using one-hot; the feature extraction unit performs one-dimensional convolution feature extraction processing according to the data processing result to obtain a word vector represented by a sequence; the double-branch network unit carries out the processing of the BiGRU and Transformer double-branch network according to the information after the characteristic extraction to obtain the sequence expression obtained by the two branches; the classification layer unit carries out MLP classification processing according to the information obtained by the double branch network to obtain the probability of the interaction between the nucleotide and the RBP; and a prediction result post-processing unit generates a full-length circRNA and RBP binding motif according to the obtained probability to realize the recognition of the nucleotide site.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111501583.2A CN114187963A (en) | 2021-12-09 | 2021-12-09 | Prediction method of protein binding nucleotide sites on full-length circular RNA |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111501583.2A CN114187963A (en) | 2021-12-09 | 2021-12-09 | Prediction method of protein binding nucleotide sites on full-length circular RNA |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114187963A true CN114187963A (en) | 2022-03-15 |
Family
ID=80604112
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111501583.2A Pending CN114187963A (en) | 2021-12-09 | 2021-12-09 | Prediction method of protein binding nucleotide sites on full-length circular RNA |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114187963A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115273979A (en) * | 2022-07-04 | 2022-11-01 | 苏州大学 | Mononucleotide nonsense mutation pathogenicity prediction system based on self-attention mechanism |
-
2021
- 2021-12-09 CN CN202111501583.2A patent/CN114187963A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115273979A (en) * | 2022-07-04 | 2022-11-01 | 苏州大学 | Mononucleotide nonsense mutation pathogenicity prediction system based on self-attention mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fudenberg et al. | Predicting 3D genome folding from DNA sequence with Akita | |
CN113593631B (en) | Method and system for predicting protein-polypeptide binding site | |
CN111798921A (en) | RNA binding protein prediction method and device based on multi-scale attention convolution neural network | |
US11514289B1 (en) | Generating machine learning models using genetic data | |
CN112837747B (en) | Protein binding site prediction method based on attention twin network | |
CN114373547B (en) | Disease risk prediction method and system | |
CN106295246A (en) | Find the lncRNA relevant to tumor and predict its function | |
US11398297B2 (en) | Systems and methods for using machine learning and DNA sequencing to extract latent information for DNA, RNA and protein sequences | |
CN113764034A (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
CN110046714A (en) | Long-chain non-coding RNA subcellular localization method based on multicharacteristic information fusion | |
Shujaat et al. | Cr-prom: A convolutional neural network-based model for the prediction of rice promoters | |
CN114582420B (en) | Transcription factor binding site prediction method and system based on fault-tolerant coding and multi-scale dense connection network | |
Yu et al. | SANPolyA: a deep learning method for identifying Poly (A) signals | |
CN114187963A (en) | Prediction method of protein binding nucleotide sites on full-length circular RNA | |
CN109801681B (en) | SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm | |
CN115511798A (en) | Pneumonia classification method and device based on artificial intelligence technology | |
KR20230125038A (en) | Protein Amino Acid Sequence Prediction Using Generative Models Conditioned on Protein Structure Embedding | |
CN118038995A (en) | Method and system for predicting small open reading window coding polypeptide capacity in non-coding RNA | |
CN116110493B (en) | Data set construction method for G-quadruplex prediction model and prediction method thereof | |
CN116959585A (en) | Deep learning-based whole genome prediction method | |
KR102336311B1 (en) | Model for Predicting Cancer Prognosis using Deep learning | |
CN116403713A (en) | Method for predicting autism spectrum barrier risk genes based on multiclass unsupervised feature extraction method | |
CN114566215B (en) | Double-end paired splice site prediction method | |
Zhou et al. | Comprehensive application of AI algorithms with TCR NGS data for glioma diagnosis | |
Brandenburg et al. | Inverse folding based pre-training for the reliable identification of intrinsic transcription terminators |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |