CN114882945A - Ensemble learning-based RNA-protein binding site prediction method - Google Patents

Ensemble learning-based RNA-protein binding site prediction method Download PDF

Info

Publication number
CN114882945A
CN114882945A CN202210807909.2A CN202210807909A CN114882945A CN 114882945 A CN114882945 A CN 114882945A CN 202210807909 A CN202210807909 A CN 202210807909A CN 114882945 A CN114882945 A CN 114882945A
Authority
CN
China
Prior art keywords
models
rna
model
network
protein binding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210807909.2A
Other languages
Chinese (zh)
Inventor
潘正森
周树森
邹海林
柳婵娟
王庆军
臧睦君
刘通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai New And Old Kinetic Energy Conversion Research Institute And Yantai Demonstration Base For Transfer And Transformation Of Scientific And Technological Achievements
Ludong University
Original Assignee
Yantai New And Old Kinetic Energy Conversion Research Institute And Yantai Demonstration Base For Transfer And Transformation Of Scientific And Technological Achievements
Ludong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai New And Old Kinetic Energy Conversion Research Institute And Yantai Demonstration Base For Transfer And Transformation Of Scientific And Technological Achievements, Ludong University filed Critical Yantai New And Old Kinetic Energy Conversion Research Institute And Yantai Demonstration Base For Transfer And Transformation Of Scientific And Technological Achievements
Priority to CN202210807909.2A priority Critical patent/CN114882945A/en
Publication of CN114882945A publication Critical patent/CN114882945A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Biotechnology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the field of bioinformatics, and relates to an integrated learning-based RNA-protein binding site prediction method, aiming at improving the prediction performance of a deep learning model on RNA-protein binding sites. Firstly, selecting a convolutional neural network, a residual neural network and a long-term and short-term memory network as sub-model networks of an integrated learning model; secondly, respectively training three sub-model networks; and finally, preprocessing an RNA sequence to be predicted as the input of the three submodel networks, and taking the average value of the output of the three submodel networks as the final prediction result. The model provided by the invention is tested on the data set RBP-24, and compared with other models, the model obtains the average AUC of 0.951 in 24 experiments on the RBP-24 data set, which exceeds other methods.

Description

Ensemble learning-based RNA-protein binding site prediction method
Technical Field
The invention belongs to the field of bioinformatics, and relates to an integrated learning-based RNA-protein binding site prediction method, which comprises technologies such as a convolutional neural network, a long-term and short-term memory network, a residual neural network and RNA sequence data processing.
Background
RNA binding proteins play an important role in regulating the vital activities of living cells, but high throughput assay methods for finding RNA-protein binding sites are time consuming and laborious and produce large numbers of false positive and false negative samples due to interference from experimental noise. Deep learning is a powerful tool for predicting RNA-binding proteins, but most existing deep learning methods use only one type of network.
Disclosure of Invention
The most important innovation point of the invention is to provide a novel RNA-protein binding site prediction method based on ensemble learning, integrate various deep learning methods and improve the performance of the model for predicting the RNA-protein binding site.
An ensemble learning-based RNA-protein binding site prediction method, comprising the following steps:
s1: constructing an ensemble learning model: using 3 deep learning models of a convolutional neural network, a convolutional-long and short term memory network and a residual neural network as submodels of the integrated learning model;
s2: training and saving the submodels: respectively training the 3 seed network models in the S1 by using training set data, reducing loss based on a back propagation algorithm, and storing the trained 3 seed network models;
s3: predicted binding site: the RNA sequence to be predicted is pre-processed, and the 3-seed network model stored in S2 is used to predict whether the RNA sequence contains RNA-protein binding sites.
An ensemble learning-based RNA-protein binding site prediction method, wherein the implementation process of step S1 is as follows:
and 3 deep learning models of a convolutional neural network, a convolutional-long short-term memory network and a residual neural network are selected as submodels of the integrated learning model. The types and the number of the sub-models of the integrated learning model are not limited to the above 3 types, the better the performance of the sub-models is, and the more the number of the sub-models is, the better the effect of the integrated learning model is.
An ensemble learning-based RNA-protein binding site prediction method, wherein the implementation process of step S2 is as follows:
and (5) respectively training the 3 seed network models in the step S1, reducing the loss of the models by using a cross entropy loss function, training 50 rounds of each model, and respectively storing the trained 3 seed network models.
An ensemble learning-based RNA-protein binding site prediction method, wherein the implementation process of step S3 is as follows:
pre-processing the RNA sequence data to be predicted, and respectively using the pre-processed RNA sequence data as the input of the 3 seed network models stored in the step S2, wherein each sub-network model obtains a prediction result, and the final prediction result is the average value of the prediction results of the 3 seed network models.
RNA-binding proteins are highly involved in human life activities, and studies have shown that mutations in RNA-binding proteins can cause several serious human diseases. Therefore, decoding the RNA protein binding site is of great significance for the research and treatment of related diseases in the medical field. The RNA-protein binding site prediction method based on ensemble learning is helpful for rapidly and accurately recognizing the RNA-protein binding site in the RNA sequence.
Drawings
FIG. 1 is a flow diagram of training and testing an ensemble learning model.
Detailed Description
The invention is described in detail below with reference to the figures and examples.
The invention aims to provide an ensemble learning-based RNA-protein binding site prediction method, which comprises the following steps:
step 1, use of nucleiPreprocessing RNA sequences in a window with the nucleotide length of 101, 151, 201, 251, 301, 351, 401, 451 and 501, converting the processed sequences into single heat coding matrixes, and recording the obtained 9 single heat coding matrixes as
Figure 245144DEST_PATH_IMAGE001
Step 2, building a convolutional neural network submodel, wherein the convolutional neural network comprises the following steps: two convolutional layers, a full-link layer and an output layer, using 9 single-hot codes preprocessed by the training set
Figure 788383DEST_PATH_IMAGE002
Training model 50 times, storing, and recording the stored convolution neural network submodel as
Figure 709066DEST_PATH_IMAGE003
And 3, building a convolution-long and short term memory network submodel, wherein the convolution-long and short term memory network submodel comprises: a convolutional layer, a long short term memory layer, a full link layer and an output layer, 9 single hot codes preprocessed by using a training set
Figure 228909DEST_PATH_IMAGE002
Training model 50 rounds and storing, recording the stored long-term and short-term memory network submodels as
Figure 681098DEST_PATH_IMAGE004
Step 4, building a residual error neural network submodel, wherein the residual error neural network submodel comprises: 9 residual blocks, each consisting of two convolutional layers, a full-link layer and an output layer, using 9 single thermal codes preprocessed by a training set
Figure 893774DEST_PATH_IMAGE002
Training model 50 rounds and storing, and recording the stored residual error neural network submodel as
Figure 52485DEST_PATH_IMAGE005
Step 5, preprocessing the RNA sequence to be predicted into 9 single heat coding matrixes according to the processing mode of the training set
Figure 641598DEST_PATH_IMAGE006
Using network submodels
Figure 202155DEST_PATH_IMAGE007
Forward propagating for 1 time to obtain 9 predicted results
Figure 585732DEST_PATH_IMAGE008
(ii) a Using network submodels
Figure 231739DEST_PATH_IMAGE009
Forward propagating for 1 time to obtain 9 predicted results
Figure 358964DEST_PATH_IMAGE010
(ii) a Using network submodels
Figure 23294DEST_PATH_IMAGE011
Forward propagating for 1 time to obtain 9 predicted results
Figure 344817DEST_PATH_IMAGE012
(ii) a The final predictors were averaged over 27 predictors.
The results of the experiments are shown in the following table:
TABLE 1 comparative experimental results
Figure 711076DEST_PATH_IMAGE013
As can be seen from Table 1, the average AUC of the ensemble learning method proposed by us over 24 data sets reached 0.951, which exceeds GraphProt, deepnep-rbp, iDeepE, DeepCLIP, iDeepC and MCNN. This demonstrates that the ensemble learning based prediction method of RNA-protein binding sites is effective.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (4)

1. An ensemble learning-based RNA-protein binding site prediction method, comprising the following steps:
s1: constructing an ensemble learning model: using 3 deep learning models of a convolutional neural network, a convolutional-long and short term memory network and a residual neural network as submodels of the integrated learning model;
s2: training and saving the submodels: respectively training the 3 seed network models in the S1 by using training set data, reducing loss based on a back propagation algorithm, and storing the trained 3 seed network models;
s3: predicted binding site: the RNA sequence to be predicted is pre-processed, and the 3-seed network model stored in S2 is used to predict whether the RNA sequence contains RNA-protein binding sites.
2. The ensemble learning-based RNA-protein binding site prediction method according to claim 1, wherein the step S1 is implemented as follows:
selecting 3 deep learning models of a convolutional neural network, a convolutional-long and short term memory network and a residual neural network as submodels of the integrated learning model;
the types and the number of the sub-models of the integrated learning model are not limited to the above 3 types, the better the performance of the sub-models is, and the more the number of the sub-models is, the better the effect of the integrated learning model is.
3. The ensemble learning-based RNA-protein binding site prediction method according to claim 1, wherein the step S2 is implemented as follows:
and (5) respectively training the 3 seed network models in the step S1, reducing the loss of the models by using a cross entropy loss function, training 50 rounds of each model, and respectively storing the trained 3 seed network models.
4. The ensemble learning-based RNA-protein binding site prediction method according to claim 1, wherein the step S3 is implemented as follows:
pre-processing the RNA sequence data to be predicted, and respectively using the pre-processed RNA sequence data as the input of the 3 seed network models stored in the step S2, wherein each sub-network model obtains a prediction result, and the final prediction result is the average value of the prediction results of the 3 seed network models.
CN202210807909.2A 2022-07-11 2022-07-11 Ensemble learning-based RNA-protein binding site prediction method Withdrawn CN114882945A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210807909.2A CN114882945A (en) 2022-07-11 2022-07-11 Ensemble learning-based RNA-protein binding site prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210807909.2A CN114882945A (en) 2022-07-11 2022-07-11 Ensemble learning-based RNA-protein binding site prediction method

Publications (1)

Publication Number Publication Date
CN114882945A true CN114882945A (en) 2022-08-09

Family

ID=82683166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210807909.2A Withdrawn CN114882945A (en) 2022-07-11 2022-07-11 Ensemble learning-based RNA-protein binding site prediction method

Country Status (1)

Country Link
CN (1) CN114882945A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115966249A (en) * 2023-02-15 2023-04-14 北京科技大学 Fractional order neural network-based protein-ATP binding site prediction method and device
CN116844646A (en) * 2023-09-04 2023-10-03 鲁东大学 Enzyme function prediction method based on deep contrast learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446602A (en) * 2016-09-06 2017-02-22 中南大学 Prediction method and system for RNA binding sites in protein molecules
CN108763865A (en) * 2018-05-21 2018-11-06 成都信息工程大学 A kind of integrated learning approach of prediction DNA protein binding sites
CN110689920A (en) * 2019-09-18 2020-01-14 上海交通大学 Protein-ligand binding site prediction algorithm based on deep learning
CN113936738A (en) * 2021-12-14 2022-01-14 鲁东大学 RNA-protein binding site prediction method based on deep convolutional neural network
CN114420211A (en) * 2022-03-28 2022-04-29 鲁东大学 Attention mechanism-based RNA-protein binding site prediction method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446602A (en) * 2016-09-06 2017-02-22 中南大学 Prediction method and system for RNA binding sites in protein molecules
CN108763865A (en) * 2018-05-21 2018-11-06 成都信息工程大学 A kind of integrated learning approach of prediction DNA protein binding sites
CN110689920A (en) * 2019-09-18 2020-01-14 上海交通大学 Protein-ligand binding site prediction algorithm based on deep learning
CN113936738A (en) * 2021-12-14 2022-01-14 鲁东大学 RNA-protein binding site prediction method based on deep convolutional neural network
CN114420211A (en) * 2022-03-28 2022-04-29 鲁东大学 Attention mechanism-based RNA-protein binding site prediction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHENGSEN PAN 等: "MCNN: multiple convolutional neural networks for RNA-protein binding sites prediction", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 *
董正心 等: "RBP结合位点预测的深度学习方法进展", 《桂林电子科技大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115966249A (en) * 2023-02-15 2023-04-14 北京科技大学 Fractional order neural network-based protein-ATP binding site prediction method and device
CN115966249B (en) * 2023-02-15 2023-05-26 北京科技大学 protein-ATP binding site prediction method and device based on fractional order neural network
CN116844646A (en) * 2023-09-04 2023-10-03 鲁东大学 Enzyme function prediction method based on deep contrast learning
CN116844646B (en) * 2023-09-04 2023-11-24 鲁东大学 Enzyme function prediction method based on deep contrast learning

Similar Documents

Publication Publication Date Title
CN114882945A (en) Ensemble learning-based RNA-protein binding site prediction method
CN113593631B (en) Method and system for predicting protein-polypeptide binding site
WO2019041333A1 (en) Method, apparatus, device and storage medium for predicting protein binding sites
CN114420211A (en) Attention mechanism-based RNA-protein binding site prediction method
CN111294058B (en) Channel coding and error correction decoding method, equipment and storage medium
CN111490853A (en) Channel coding parameter identification method based on deep convolutional neural network
EP3311318B1 (en) Method for compressing genomic data
CN113936738B (en) RNA-protein binding site prediction method based on convolutional neural network
Castelo et al. Splice site identification by idl BNs
CN114023376B (en) RNA-protein binding site prediction method and system based on self-attention mechanism
CN114582420B (en) Transcription factor binding site prediction method and system based on fault-tolerant coding and multi-scale dense connection network
CN107577918A (en) The recognition methods of CpG islands, device based on genetic algorithm and hidden Markov model
Song et al. Importance weighted expectation-maximization for protein sequence design
CN115169518A (en) Method and device for algorithm optimization
CN113539358B (en) Hilbert coding-based enhancer-promoter interaction prediction method and device
CN111126560A (en) Method for optimizing BP neural network based on cloud genetic algorithm
Kao et al. naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing
Brejová et al. Optimal spaced seeds for Hidden Markov Models, with application to homologous coding regions
CN115088038A (en) Improved quality value compression framework in aligned sequencing data based on new context
CN117334252A (en) Cancer driving gene identification method based on heterophilic graph information maximization
Dawy et al. Mutual information based distance measures for classification and content recognition with applications to genetics
CN112365924A (en) Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method
WO2017158330A1 (en) Compression/decompression method and apparatus for genomic variant call data
CN111859807A (en) Initial pressure optimizing method, device, equipment and storage medium for steam turbine
US20080103701A1 (en) Automatic signal processor design software system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220809

WW01 Invention patent application withdrawn after publication