CN113764031B - Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA - Google Patents

Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA Download PDF

Info

Publication number
CN113764031B
CN113764031B CN202111089684.3A CN202111089684A CN113764031B CN 113764031 B CN113764031 B CN 113764031B CN 202111089684 A CN202111089684 A CN 202111089684A CN 113764031 B CN113764031 B CN 113764031B
Authority
CN
China
Prior art keywords
elmo
sequence
model
species
tissue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111089684.3A
Other languages
Chinese (zh)
Other versions
CN113764031A (en
Inventor
樊永显
孙贵聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202111089684.3A priority Critical patent/CN113764031B/en
Publication of CN113764031A publication Critical patent/CN113764031A/en
Application granted granted Critical
Publication of CN113764031B publication Critical patent/CN113764031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Analytical Chemistry (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Physiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a prediction method of N6 methyl adenosine (m 6A) locus in a trans-tissue/species RNA sequence, which comprises the following steps: obtaining a positive sample sequence and an equivalent negative sample sequence across a species/tissue; performing motif mining by using a STREME method; feature encoding is performed by using an ELMo method inspired from NLP; constructing a prediction model, and inputting data to obtain a preliminary prediction result; parameters in the trained prediction model are adjusted, and the trained prediction model is optimized; performing an interpretive analysis on the model by using an IG method; predictive models were evaluated using 5-fold cross-validation and independent testing. The method uses ELMo to perform feature coding, does not need to use biological priori knowledge, and utilizes a depth network to excavate deep information of the sequence, thereby improving the accuracy of predicting m6A modification sites.

Description

Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA
Technical Field
The invention relates to the technical field of modification site prediction of sequence analysis in bioinformatics, in particular to a method for predicting a DNA replication origin in saccharomyces cerevisiae.
Background
RNA modification is used as a key posttranscriptional regulator of gene expression, and affects various biological processes of eukaryotes. N6 methyl adenosine (m 6A), the most abundant and important post-transcriptional modified RNA, plays a wide and important role in various biological processes, such as splice regulation, mRNA stability, translation efficiency and epigenetic regulation. More and more studies report that m6A methylation is involved in and affects the pathogenesis of a variety of diseases. Therefore, it is of great importance to explore new strategies for disease treatment based on m6A RNA methylation.
High throughput sequencing techniques greatly facilitate the study of m6A modification sites, however, they are still time consuming and expensive. Therefore, there is a need to develop more efficient computational methods or tools to supplement wet laboratory experiments. In recent years, machine learning has been widely used to solve the correction site prediction problem. In 2020, dao et al accurately predicted the m6A site by extracting features from the RNA sequence based on biological prior knowledge using a machine learning algorithm, i.e., a Support Vector Machine (SVM). In 2019, feng et al combined the physicochemical properties of the nucleotides, improving the performance of predicting the m6A site. Huang et al integrated multiple sequence coding schemes and machine learning algorithms through different integration strategies in 2018. Due to the continuous accumulation of m6A sites, deep learning methods are emerging. Various methods prove that the deep learning can be used for predicting m6A sites of different species, and has good performance.
Disclosure of Invention
The invention aims to solve the problem of the existing prediction accuracy of N6-methyl adenosine (m 6A) locus, and provides a predictor embedded based on a context language, which is used for detecting the m6A locus of RNA. The prediction method can predict m6A modification sites across species and tissues, does not need to utilize biological priori knowledge in feature coding, reduces calculation time, builds an optimal classification model, and improves accuracy of predicting m6A sites.
In order to solve the problems existing in the prior art, the invention adopts the following technical scheme:
a method for predicting a DNA replication origin in saccharomyces cerevisiae comprises the following steps:
1) Acquiring a sample data set: obtaining a positive sample sequence and an equal amount of negative sample sequence across species/tissues (including 3 species of tissues);
2) Motif excavation: performing motif mining on sample data by using a STREME method, mining and comparing motif information of positive and negative samples of the analysis data set;
3) Feature coding: feature coding is carried out by using an ELMo feature coding mode inspired from NLP, a specific coding result is formed by embedding ELMo bottom words and splicing two layers of BILSTM hidden layers, and finally each sequence is represented by a two-dimensional numerical matrix;
4) And (3) constructing a model: after feature codes are obtained by ELMo, extracting multi-scale abstract features by CNNs with different scales, inputting the multi-scale abstract features into a BLSTM, obtaining long-range dependency relationships, and generating a decision result;
5) And (3) model tuning: optimizing the prediction model by using methods such as batch normalization, dropout, early stopping method and the like;
6) Model interpretability analysis: using Integral Gradient (IG) method to calculate integral gradient componentized sequence base contribution score for visualization and variance analysis;
7) Model evaluation: model evaluation was performed based on 5-fold cross-validation and independent testing, specifically using sensitivity (Sn), specificity (Sp), accuracy (Acc), ma Xiusi correlation coefficient (MCC), area under ROC curve (AUC) for performance measurements.
In step 2), the STREME used was the most recent method of motif discovery and analysis of the consensus sequence representing the m6A site was performed.
In step 3), the ELMo code used is a cross-domain transfer learning. We used the official release ELMo model to learn the RNA embedding in our data. Wherein each base can be calculated as follows:
in formula (1), k is the index of the base, and j is the index of the layer, soIs the embedding of the bottom words, and->Based on the autoregressive structure, ELMo can make full use of the bilateral context to represent the current word. For each word in the sentence, ELMo generates 3 256-D embedded vectors, including bottom-embedded for extracting word information. We combine the 3-layer word embedding into the final feature encoding:
in equation (2), con (·) is a linear splicing operation.
In step 6), visualization and difference analysis are performed by using an IG method, and the IG meets two axioms, namely sensitivity and invariance realization. From the baselineTo input->The integral gradient of (2) is calculated as follows:
equation (3) is used to quantify the impact of the input eigenvalues on the network output, specifically, we take the sequence coding feature as input, calculate the corresponding contribution score of a single nucleotide by IG, and draw the sequence identity accordingly.
The beneficial effects are that:
the project adopts ELMo inspired by NLP to perform feature coding, then CNNs with different scales are applied to extract multi-scale abstract features, and the multi-scale abstract features are input into BLSTM to capture long-range dependency relationships so as to generate decision results. Experimental results show that compared with the prior method, the method has remarkable performance improvement. The AUC values for all data sets are higher than in the most advanced methods. Furthermore, we analyzed the m6A sequence, found that these three mammals were highly conserved, verifying the feasibility of our approach as a cross-species predictor. Furthermore, our method demonstrates a strong generalization ability in cross species/organization validation and significant performance in independent dataset.
Drawings
FIG. 1 is a flow chart of a method for predicting N6 methyl adenosine sites in trans-tissue/species RNA sequences.
Fig. 2 is a statistically significant consensus topic for secondary in 11 data sets.
Fig. 3 is an attribute graph of an example with highest and lowest prediction probabilities.
Fig. 4 is a cross-validation comparison experiment.
Fig. 5 is the results of independent test comparative experiments.
Detailed Description
The present invention will now be further illustrated with reference to the drawings and examples, but is not limited thereto.
Examples: as shown in fig. 1, a method for predicting N6 methyl adenosine sites in trans-tissue/species RNA sequences comprises the steps of:
1) Acquiring a sample data set: positive sample sequences and equivalent negative sample sequences were obtained across species/tissues (including 3 species of multiple tissues), and detailed information about the dataset is provided in table 1. For convenience, we labeled 11 datasets as follows, depending on species and organization: human (brain: h_b, kidney: h_k and liver: h_l), mouse (brain: m_b, heart: m_h, kidney: m_k, liver: m_l and testis: m_t) and rat (brain: r_b, kidney: r_k and liver: r_l).
TABLE 1 details of reference dataset for predicting m6A locus
2) Motif excavation: the method is used for carrying out motif mining on sample data, the STREME is the latest motif discovery method, motif information of positive and negative samples of a data set can be mined and compared and analyzed, and as can be seen from fig. 2, the three mammals have high conservation, and the feasibility of using the method as a cross-species predictor is verified;
3) Feature coding: and feature coding is carried out by using an ELMo feature coding mode inspired from NLP, a specific coding result is formed by embedding ELMo bottom words and splicing two layers of BILSTM hidden layers, and finally each sequence is represented by a two-dimensional numerical matrix. Specifically, we used the official release ELMo model to learn the RNA embedding in our data. Wherein each base can be calculated as follows:
in formula (1), k is the index of the base, and j is the index of the layer, soIs the embedding of the bottom words, and->. Based on the autoregressive structure, ELMo can make full use of the bilateral context to represent the current word. For each word in the sentence, ELMo generates 3 256-D embedded vectors, including bottom-embedded for extracting word information. We combine the 3-layer word embedding into the final feature encoding:
in formula (2), con (·) is a linear splicing operation;
4) And (3) constructing a model: extracting multi-scale abstract features by using ELMo feature codes as input and CNNs with different scales, specifically extracting features based on base pairs by using convolution size 2 and extracting features based on codons by using convolution size 3, so as to form two deep network branches, then respectively inputting deep information in a bisTM (binary field pattern) for mining, and finally predicting an m6A site by a full-connection layer with a sigmoid activation function;
5) And (3) model tuning: we add batch normalization operation after the first convolution layer to control gradient explosion, and add multiple dropout layers with 0.3 value to prevent model overfitting, in addition, dynamically modify learning rate by means of callback function, and acquire optimal model by early stop method, optimize prediction model;
6) Model interpretability analysis: visualization and differential analysis using IG method, IG satisfying both axiom, sensitivity and implementation invariance, from baselineTo input->The integral gradient of (2) is calculated as follows:
equation (3) is used for quantifying the influence of an input characteristic value on network output, specifically, we take a sequence coding characteristic as an input, calculate the integral gradient of a single nucleotide through IG, quantify the contribution score, and draw a sequence identifier according to the integral gradient, as shown in FIG. 3, so as to perform visualization and difference analysis;
7) Model evaluation: to evaluate the predictor, we used the following five metrics, using 5-fold cross-validation to test the validity of the predictor and an independent dataset to verify the generalization ability of the predictor: sensitivity (Sn), specificity (Sp), accuracy (Acc), and Ma Xiusi correlation coefficient (MCC), defined as:
where TP is the true number of samples, FP is the false positive number of samples, TN is the true negative number of samples, FN is the false negative number of samples, and in addition, the subject operating characteristic (ROC) curve and the area under ROC curve (AUC) are also used to measure the performance of the predicted variables;
finally, comparing the method with the existing most advanced method, wherein the cross-validation comparison result is shown in fig. 4, the independent test comparison result is shown in fig. 5, and the accuracy of the method prediction in the embodiment is obviously superior to that of other methods.

Claims (2)

1. A method for predicting an N6 methyl adenosine site in a trans-tissue trans-species RNA sequence, comprising the steps of:
1) Acquiring a sample data set: acquiring a positive sample sequence and an equivalent negative sample sequence of a cross-tissue cross-species;
2) Motif excavation: performing motif mining on sample data by using a STREME method, mining and comparing motif information of positive and negative samples of the analysis data set;
3) Feature coding: feature coding is carried out by using an ELMo feature coding mode inspired from NLP, a specific coding result is formed by embedding ELMo bottom words and splicing two layers of BiLSTM hidden layers, and finally each sequence is represented by a two-dimensional numerical matrix: the ELMo model was applied to learn RNA intercalation in sequence data, where each base can be calculated as follows:
in formula (1), k is the index of the base, and j is the index of the layer, soIs the embedding of the bottom words, and->
Based on autoregressive structure, ELMo can fully use bilateral context to represent the current word, generating 3 256-D embedded vectors for each word in a sentence, including bottom embedding for extracting word information, we combine 3-layer word embedding into the final feature code:
in formula (2), con (·) is a linear splicing operation;
4) And (3) constructing a model: after feature codes are obtained by ELMo, extracting multi-scale abstract features by CNNs with different scales, inputting the multi-scale abstract features into a BLSTM, obtaining long-range dependency relationships, and generating a decision result;
5) And (3) model tuning: optimizing the prediction model by using a batch normalization method, a dropout method and an early-stop method;
6) Model interpretability analysis: calculating the base contribution score of the integral gradient componentized sequence by using a IntegralGradient (IG) method, thereby performing visualization and difference analysis;
7) Model evaluation: model evaluation was performed based on 5-fold cross-validation and independent testing, specifically using sensitivity (Sn), specificity (Sp), accuracy (Acc), ma Xiusi correlation coefficient (MCC), area under ROC curve (AUC) for performance measurements.
2. The method of claim 1, wherein the ELMo code used in step 3) is a cross-domain transfer study.
CN202111089684.3A 2021-09-16 2021-09-16 Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA Active CN113764031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111089684.3A CN113764031B (en) 2021-09-16 2021-09-16 Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111089684.3A CN113764031B (en) 2021-09-16 2021-09-16 Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA

Publications (2)

Publication Number Publication Date
CN113764031A CN113764031A (en) 2021-12-07
CN113764031B true CN113764031B (en) 2023-07-18

Family

ID=78796311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111089684.3A Active CN113764031B (en) 2021-09-16 2021-09-16 Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA

Country Status (1)

Country Link
CN (1) CN113764031B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424663B (en) * 2022-10-14 2024-04-12 徐州工业职业技术学院 RNA modification site prediction method based on attention bidirectional expression model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11538558B2 (en) * 2018-10-11 2022-12-27 The Regents Of The University Of California Optimization of gene sequences for protein expression

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于核酸物化属性显著性约简的m~6A位点识别;张明;徐妍;陈韬;王长宝;於东军;;南京理工大学学报(第02期);全文 *

Also Published As

Publication number Publication date
CN113764031A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
Fan et al. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning
CN111161793B (en) Stacking integration based N in RNA 6 Method for predicting methyladenosine modification site
Shukla et al. A hybrid gene selection method for microarray recognition
CN113764031B (en) Prediction method of N6 methyl adenosine locus in trans-tissue/species RNA
CN108427865B (en) Method for predicting correlation between LncRNA and environmental factors
CN113257357B (en) Protein residue contact map prediction method
Ji et al. DFL-PiDA: prediction of Piwi-interacting RNA-disease associations based on deep feature learning
Raza et al. iPro-TCN: Prediction of DNA Promoters Recognition and their Strength Using Temporal Convolutional Network
CN114864002B (en) Transcription factor binding site recognition method based on deep learning
Jha et al. Fast and precise prediction of non-coding RNAs (ncRNAs) using sequence alignment and k-mer counting
CN111599412B (en) DNA replication initiation region identification method based on word vector and convolutional neural network
CN111223522B (en) Method for identifying lncRNA based on fuzzy k-mer utilization rate
WO2016187898A1 (en) Metabolite ms/ms mass spectrum computer simulation method
CN113421614A (en) Tensor decomposition-based lncRNA-disease association prediction method
Du et al. Predicting TF proteins by incorporating evolution information through PSSM
Zhang et al. StackRAM: a cross-species method for identifying RNA N6-methyladenosine sites based on stacked ensemble
WO2023150898A1 (en) Method for identifying chromatin structural characteristic from hi-c matrix, non-transitory computer readable medium storing program for identifying chromatin structural characteristic from hi-c matrix
Wang et al. Combining diffusion and hetesim features for accurate prediction of protein-lncrna interactions
Yaman et al. MachineTFBS: Motif-based method to predict transcription factor binding sites with first-best models from machine learning library
Ma et al. A New Approach Based on Feature Selection of Light Gradient Boosting Machine and Transformer to Predict circRNA-disease Associations
Noorul et al. Evaluation of deep learning in non-coding RNA classification
Sinha et al. A study of feature selection and extraction algorithms for cancer subtype prediction
Yang et al. TEC-miTarget: enhancing microRNA target prediction based on deep learning of ribonucleic acid sequences
Liang et al. MGFmiRNAloc: predicting miRNA subcellular localization using molecular graph feature and Convolutional Block Attention Module
Vo et al. Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant