CN114882945A - Ensemble learning-based RNA-protein binding site prediction method - Google Patents
Ensemble learning-based RNA-protein binding site prediction method Download PDFInfo
- Publication number
- CN114882945A CN114882945A CN202210807909.2A CN202210807909A CN114882945A CN 114882945 A CN114882945 A CN 114882945A CN 202210807909 A CN202210807909 A CN 202210807909A CN 114882945 A CN114882945 A CN 114882945A
- Authority
- CN
- China
- Prior art keywords
- models
- rna
- model
- network
- protein binding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Biotechnology (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention belongs to the field of bioinformatics, and relates to an integrated learning-based RNA-protein binding site prediction method, aiming at improving the prediction performance of a deep learning model on RNA-protein binding sites. Firstly, selecting a convolutional neural network, a residual neural network and a long-term and short-term memory network as sub-model networks of an integrated learning model; secondly, respectively training three sub-model networks; and finally, preprocessing an RNA sequence to be predicted as the input of the three submodel networks, and taking the average value of the output of the three submodel networks as the final prediction result. The model provided by the invention is tested on the data set RBP-24, and compared with other models, the model obtains the average AUC of 0.951 in 24 experiments on the RBP-24 data set, which exceeds other methods.
Description
Technical Field
The invention belongs to the field of bioinformatics, and relates to an integrated learning-based RNA-protein binding site prediction method, which comprises technologies such as a convolutional neural network, a long-term and short-term memory network, a residual neural network and RNA sequence data processing.
Background
RNA binding proteins play an important role in regulating the vital activities of living cells, but high throughput assay methods for finding RNA-protein binding sites are time consuming and laborious and produce large numbers of false positive and false negative samples due to interference from experimental noise. Deep learning is a powerful tool for predicting RNA-binding proteins, but most existing deep learning methods use only one type of network.
Disclosure of Invention
The most important innovation point of the invention is to provide a novel RNA-protein binding site prediction method based on ensemble learning, integrate various deep learning methods and improve the performance of the model for predicting the RNA-protein binding site.
An ensemble learning-based RNA-protein binding site prediction method, comprising the following steps:
s1: constructing an ensemble learning model: using 3 deep learning models of a convolutional neural network, a convolutional-long and short term memory network and a residual neural network as submodels of the integrated learning model;
s2: training and saving the submodels: respectively training the 3 seed network models in the S1 by using training set data, reducing loss based on a back propagation algorithm, and storing the trained 3 seed network models;
s3: predicted binding site: the RNA sequence to be predicted is pre-processed, and the 3-seed network model stored in S2 is used to predict whether the RNA sequence contains RNA-protein binding sites.
An ensemble learning-based RNA-protein binding site prediction method, wherein the implementation process of step S1 is as follows:
and 3 deep learning models of a convolutional neural network, a convolutional-long short-term memory network and a residual neural network are selected as submodels of the integrated learning model. The types and the number of the sub-models of the integrated learning model are not limited to the above 3 types, the better the performance of the sub-models is, and the more the number of the sub-models is, the better the effect of the integrated learning model is.
An ensemble learning-based RNA-protein binding site prediction method, wherein the implementation process of step S2 is as follows:
and (5) respectively training the 3 seed network models in the step S1, reducing the loss of the models by using a cross entropy loss function, training 50 rounds of each model, and respectively storing the trained 3 seed network models.
An ensemble learning-based RNA-protein binding site prediction method, wherein the implementation process of step S3 is as follows:
pre-processing the RNA sequence data to be predicted, and respectively using the pre-processed RNA sequence data as the input of the 3 seed network models stored in the step S2, wherein each sub-network model obtains a prediction result, and the final prediction result is the average value of the prediction results of the 3 seed network models.
RNA-binding proteins are highly involved in human life activities, and studies have shown that mutations in RNA-binding proteins can cause several serious human diseases. Therefore, decoding the RNA protein binding site is of great significance for the research and treatment of related diseases in the medical field. The RNA-protein binding site prediction method based on ensemble learning is helpful for rapidly and accurately recognizing the RNA-protein binding site in the RNA sequence.
Drawings
FIG. 1 is a flow diagram of training and testing an ensemble learning model.
Detailed Description
The invention is described in detail below with reference to the figures and examples.
The invention aims to provide an ensemble learning-based RNA-protein binding site prediction method, which comprises the following steps:
step 1, use of nucleiPreprocessing RNA sequences in a window with the nucleotide length of 101, 151, 201, 251, 301, 351, 401, 451 and 501, converting the processed sequences into single heat coding matrixes, and recording the obtained 9 single heat coding matrixes as;
Step 2, building a convolutional neural network submodel, wherein the convolutional neural network comprises the following steps: two convolutional layers, a full-link layer and an output layer, using 9 single-hot codes preprocessed by the training setTraining model 50 times, storing, and recording the stored convolution neural network submodel as;
And 3, building a convolution-long and short term memory network submodel, wherein the convolution-long and short term memory network submodel comprises: a convolutional layer, a long short term memory layer, a full link layer and an output layer, 9 single hot codes preprocessed by using a training setTraining model 50 rounds and storing, recording the stored long-term and short-term memory network submodels as;
Step 4, building a residual error neural network submodel, wherein the residual error neural network submodel comprises: 9 residual blocks, each consisting of two convolutional layers, a full-link layer and an output layer, using 9 single thermal codes preprocessed by a training setTraining model 50 rounds and storing, and recording the stored residual error neural network submodel as;
Step 5, preprocessing the RNA sequence to be predicted into 9 single heat coding matrixes according to the processing mode of the training setUsing network submodelsForward propagating for 1 time to obtain 9 predicted results(ii) a Using network submodelsForward propagating for 1 time to obtain 9 predicted results(ii) a Using network submodelsForward propagating for 1 time to obtain 9 predicted results(ii) a The final predictors were averaged over 27 predictors.
The results of the experiments are shown in the following table:
TABLE 1 comparative experimental results
As can be seen from Table 1, the average AUC of the ensemble learning method proposed by us over 24 data sets reached 0.951, which exceeds GraphProt, deepnep-rbp, iDeepE, DeepCLIP, iDeepC and MCNN. This demonstrates that the ensemble learning based prediction method of RNA-protein binding sites is effective.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (4)
1. An ensemble learning-based RNA-protein binding site prediction method, comprising the following steps:
s1: constructing an ensemble learning model: using 3 deep learning models of a convolutional neural network, a convolutional-long and short term memory network and a residual neural network as submodels of the integrated learning model;
s2: training and saving the submodels: respectively training the 3 seed network models in the S1 by using training set data, reducing loss based on a back propagation algorithm, and storing the trained 3 seed network models;
s3: predicted binding site: the RNA sequence to be predicted is pre-processed, and the 3-seed network model stored in S2 is used to predict whether the RNA sequence contains RNA-protein binding sites.
2. The ensemble learning-based RNA-protein binding site prediction method according to claim 1, wherein the step S1 is implemented as follows:
selecting 3 deep learning models of a convolutional neural network, a convolutional-long and short term memory network and a residual neural network as submodels of the integrated learning model;
the types and the number of the sub-models of the integrated learning model are not limited to the above 3 types, the better the performance of the sub-models is, and the more the number of the sub-models is, the better the effect of the integrated learning model is.
3. The ensemble learning-based RNA-protein binding site prediction method according to claim 1, wherein the step S2 is implemented as follows:
and (5) respectively training the 3 seed network models in the step S1, reducing the loss of the models by using a cross entropy loss function, training 50 rounds of each model, and respectively storing the trained 3 seed network models.
4. The ensemble learning-based RNA-protein binding site prediction method according to claim 1, wherein the step S3 is implemented as follows:
pre-processing the RNA sequence data to be predicted, and respectively using the pre-processed RNA sequence data as the input of the 3 seed network models stored in the step S2, wherein each sub-network model obtains a prediction result, and the final prediction result is the average value of the prediction results of the 3 seed network models.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210807909.2A CN114882945A (en) | 2022-07-11 | 2022-07-11 | Ensemble learning-based RNA-protein binding site prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210807909.2A CN114882945A (en) | 2022-07-11 | 2022-07-11 | Ensemble learning-based RNA-protein binding site prediction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114882945A true CN114882945A (en) | 2022-08-09 |
Family
ID=82683166
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210807909.2A Withdrawn CN114882945A (en) | 2022-07-11 | 2022-07-11 | Ensemble learning-based RNA-protein binding site prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114882945A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115966249A (en) * | 2023-02-15 | 2023-04-14 | 北京科技大学 | Fractional order neural network-based protein-ATP binding site prediction method and device |
CN116844646A (en) * | 2023-09-04 | 2023-10-03 | 鲁东大学 | Enzyme function prediction method based on deep contrast learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446602A (en) * | 2016-09-06 | 2017-02-22 | 中南大学 | Prediction method and system for RNA binding sites in protein molecules |
CN108763865A (en) * | 2018-05-21 | 2018-11-06 | 成都信息工程大学 | A kind of integrated learning approach of prediction DNA protein binding sites |
CN110689920A (en) * | 2019-09-18 | 2020-01-14 | 上海交通大学 | Protein-ligand binding site prediction algorithm based on deep learning |
CN113936738A (en) * | 2021-12-14 | 2022-01-14 | 鲁东大学 | RNA-protein binding site prediction method based on deep convolutional neural network |
CN114420211A (en) * | 2022-03-28 | 2022-04-29 | 鲁东大学 | Attention mechanism-based RNA-protein binding site prediction method |
-
2022
- 2022-07-11 CN CN202210807909.2A patent/CN114882945A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446602A (en) * | 2016-09-06 | 2017-02-22 | 中南大学 | Prediction method and system for RNA binding sites in protein molecules |
CN108763865A (en) * | 2018-05-21 | 2018-11-06 | 成都信息工程大学 | A kind of integrated learning approach of prediction DNA protein binding sites |
CN110689920A (en) * | 2019-09-18 | 2020-01-14 | 上海交通大学 | Protein-ligand binding site prediction algorithm based on deep learning |
CN113936738A (en) * | 2021-12-14 | 2022-01-14 | 鲁东大学 | RNA-protein binding site prediction method based on deep convolutional neural network |
CN114420211A (en) * | 2022-03-28 | 2022-04-29 | 鲁东大学 | Attention mechanism-based RNA-protein binding site prediction method |
Non-Patent Citations (2)
Title |
---|
ZHENGSEN PAN 等: "MCNN: multiple convolutional neural networks for RNA-protein binding sites prediction", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 * |
董正心 等: "RBP结合位点预测的深度学习方法进展", 《桂林电子科技大学学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115966249A (en) * | 2023-02-15 | 2023-04-14 | 北京科技大学 | Fractional order neural network-based protein-ATP binding site prediction method and device |
CN115966249B (en) * | 2023-02-15 | 2023-05-26 | 北京科技大学 | protein-ATP binding site prediction method and device based on fractional order neural network |
CN116844646A (en) * | 2023-09-04 | 2023-10-03 | 鲁东大学 | Enzyme function prediction method based on deep contrast learning |
CN116844646B (en) * | 2023-09-04 | 2023-11-24 | 鲁东大学 | Enzyme function prediction method based on deep contrast learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114882945A (en) | Ensemble learning-based RNA-protein binding site prediction method | |
CN113593631B (en) | Method and system for predicting protein-polypeptide binding site | |
WO2019041333A1 (en) | Method, apparatus, device and storage medium for predicting protein binding sites | |
CN114420211A (en) | Attention mechanism-based RNA-protein binding site prediction method | |
CN111294058B (en) | Channel coding and error correction decoding method, equipment and storage medium | |
CN111490853A (en) | Channel coding parameter identification method based on deep convolutional neural network | |
EP3311318B1 (en) | Method for compressing genomic data | |
CN113936738B (en) | RNA-protein binding site prediction method based on convolutional neural network | |
Castelo et al. | Splice site identification by idl BNs | |
CN114023376B (en) | RNA-protein binding site prediction method and system based on self-attention mechanism | |
CN114582420B (en) | Transcription factor binding site prediction method and system based on fault-tolerant coding and multi-scale dense connection network | |
CN107577918A (en) | The recognition methods of CpG islands, device based on genetic algorithm and hidden Markov model | |
Song et al. | Importance weighted expectation-maximization for protein sequence design | |
CN115169518A (en) | Method and device for algorithm optimization | |
CN113539358B (en) | Hilbert coding-based enhancer-promoter interaction prediction method and device | |
CN111126560A (en) | Method for optimizing BP neural network based on cloud genetic algorithm | |
Kao et al. | naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing | |
Brejová et al. | Optimal spaced seeds for Hidden Markov Models, with application to homologous coding regions | |
CN115088038A (en) | Improved quality value compression framework in aligned sequencing data based on new context | |
CN117334252A (en) | Cancer driving gene identification method based on heterophilic graph information maximization | |
Dawy et al. | Mutual information based distance measures for classification and content recognition with applications to genetics | |
CN112365924A (en) | Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method | |
WO2017158330A1 (en) | Compression/decompression method and apparatus for genomic variant call data | |
CN111859807A (en) | Initial pressure optimizing method, device, equipment and storage medium for steam turbine | |
US20080103701A1 (en) | Automatic signal processor design software system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20220809 |
|
WW01 | Invention patent application withdrawn after publication |