CN113837293A - mRNA subcellular localization model training method, mRNA subcellular localization model localization method and readable storage medium - Google Patents

mRNA subcellular localization model training method, mRNA subcellular localization model localization method and readable storage medium Download PDF

Info

Publication number
CN113837293A
CN113837293A CN202111138369.5A CN202111138369A CN113837293A CN 113837293 A CN113837293 A CN 113837293A CN 202111138369 A CN202111138369 A CN 202111138369A CN 113837293 A CN113837293 A CN 113837293A
Authority
CN
China
Prior art keywords
mrna
subcellular
algorithm
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111138369.5A
Other languages
Chinese (zh)
Other versions
CN113837293B (en
Inventor
邹权
李静
杜军平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202111138369.5A priority Critical patent/CN113837293B/en
Publication of CN113837293A publication Critical patent/CN113837293A/en
Application granted granted Critical
Publication of CN113837293B publication Critical patent/CN113837293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a training method of an mRNA subcellular localization model, which comprises the following steps: acquiring a sample set of mRNA subcellular position sequences; and performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms, identifying features by using a base classifier respectively, integrating the base classifier by more than one layer, and acquiring a target mRNA subcellular positioning model according to the feature extraction algorithms and the base classifier. According to the invention, through the integrated learning and training of a plurality of classifiers, the training efficiency can be improved, so that the model can obtain a global optimal solution more easily in the training process, and the target model after the training has more excellent prediction capability and generalization capability.

Description

mRNA subcellular localization model training method, mRNA subcellular localization model localization method and readable storage medium
Technical Field
The invention belongs to the technical field of computers, and particularly relates to an mRNA subcellular localization model training method, an mRNA subcellular localization model localization method and a readable storage medium.
Background
Currently, subcellular localization of RNA is considered to be an important mechanism for cellular polarization during the development of unicellular organisms, animal and plant tissues, and animal embryos. Localization of mRNA transcripts has been shown to spatially localize gene expression and protein transcription. Approximately 80% of the transcripts are distributed asymmetrically in human cells, and mis-localization of transcripts may lead to diseases such as spinal muscular atrophy, alzheimer's disease, and cancer. In recent years, subcellular localization algorithms based on machine learning have been developed in great quantities. The mRNA localization corresponds to the localization of protein translation, and contributes to the study of protein function. However, current studies on eukaryotic mRNA subcellular localization show significant limitations, often based on the extraction of single sequence information, and are deficient in predictive and generalization capabilities. In order to better achieve the subcellular localization of eukaryotic mRNA, models with better performance and more comprehensive functions must be established and trained.
Disclosure of Invention
The invention provides a training method and a positioning method of an mRNA subcellular localization model and a readable storage medium, aiming at the problem of insufficient single sequence information-based extraction, prediction capability and generalization capability in the prior art.
According to an embodiment of the present invention, the present invention provides a training method of an mRNA subcellular localization model, comprising the following steps:
s1, acquiring a mRNA subcellular position sequence sample set;
s2, performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms to obtain a plurality of feature sets;
s3, respectively identifying the feature sets according to the base classifiers, and performing at least one layer of integration on the base classifiers to obtain integrated classifiers;
s4, obtaining a target mRNA subcellular localization model according to the multiple feature extraction algorithms and the integrated classifier.
Optionally, step S1, including the steps of:
s11 obtaining mRNA subcellular position sequence data as positive data and negative data;
s12, the positive data and the negative data are subjected to data processing to obtain a sample set of mRNA subcellular position sequences.
Optionally, in step S2, the plurality of feature extraction algorithms includes: the method comprises any three or more of an electron-ion interaction trinucleotide algorithm, a trinucleotide composition algorithm, a dinucleotide composition algorithm, a k-spaced nucleic acid pair composition algorithm, a parallel related pseudo-dinucleotide composition algorithm, a parallel related pseudo-trinucleotide composition algorithm, a sequence related pseudo-dinucleotide composition algorithm, a sequence related pseudo-trinucleotide composition algorithm and a dinucleotide-based self-cross covariance algorithm.
Optionally, step S3, including the steps of:
s31, matching corresponding base classifiers according to the data characteristics and the time complexity of the subcellular positions of the feature set for identification;
s32, performing at least one layer of integration on the base classifier to obtain a target weight parameter;
and S33, obtaining an integrated classifier according to the base classifier and the target weight parameter.
Optionally, the base classifier comprises a LightGBM algorithm.
Optionally, step S3 is: and respectively identifying the feature sets according to the base classifiers, and performing two-layer integration on the base classifiers to obtain the target weight parameters of the integrated classifiers.
Optionally, step S3, including the steps of:
s31, respectively identifying the feature sets according to the base classifiers to obtain a prediction result;
s32, grouping the base classifiers according to the feature sets corresponding to the base classifiers to obtain a base classifier group;
s33, obtaining a target weight parameter according to a grid searching and evaluating algorithm;
s34, obtaining a two-layer integrated classifier according to the base classifier, the base classifier group and the target weight parameter.
Optionally, the evaluation algorithm comprises the steps of: ACC algorithm, call algorithm, Precision algorithm, F1-score algorithm; the calculation formula is as follows:
Figure BDA0003283101250000021
Figure BDA0003283101250000022
Figure BDA0003283101250000023
Figure BDA0003283101250000024
when the real label is a positive sample, TP and FN respectively represent the number of samples with positive or negative prediction results of the samples; when the true label of the sample is negative, TN and FP indicate that the predicted label is negative or the predicted label is positive, respectively.
According to an embodiment of the present invention, the present invention also provides a method for subcellular localization of mRNA, comprising the steps of:
obtaining mRNA subcellular position sequence samples;
and identifying the mRNA subcellular location sequence sample by using the target mRNA subcellular location model to obtain a location prediction result.
The invention also provides a computer-readable storage medium, on which a computer program is stored, according to an embodiment of the invention, characterized in that the computer program realizes the method steps as described above when executed by a processor.
The invention has the beneficial effects that:
the method comprises the steps of extracting features of an mRNA subcellular position sequence sample set through various feature extraction algorithms, identifying a plurality of feature sets by utilizing a plurality of base classifiers respectively, integrating at least one layer of the base classifiers, training and learning a two-layer model through the mRNA subcellular position sequence sample set, and training the structure of the two layers. After the target model is used for identifying the mRNA subcellular position sequence sample needing to be predicted, the positioning prediction result can be more accurate.
Drawings
FIG. 1 is a flowchart of a method for training a model for mRNA subcellular localization according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for training a model for mRNA subcellular localization according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for training a model for mRNA subcellular localization according to an embodiment of the present invention;
FIG. 4 is a graph comparing the performance of a single feature-based classifier, a single-layer integration model, and a two-layer integration model according to an embodiment of the present invention;
FIG. 5 is a graph illustrating a comparison of performance between a single feature-based classifier and a single-layer integration model according to an embodiment of the present invention;
FIG. 6 is a graph comparing the performance of the model of the present invention with other models of mRNA subcellular localization.
Detailed Description
Referring to fig. 1 to 3, the present invention provides a method for training an mRNA subcellular localization model, comprising the following steps: s1, acquiring a mRNA subcellular position sequence sample set;
s2, performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms to obtain a plurality of feature sets;
s3, respectively identifying the feature sets according to the base classifiers, and performing at least one layer of integration on the base classifiers to obtain integrated classifiers;
s4, obtaining a target mRNA subcellular localization model according to the multiple feature extraction algorithms and the integrated classifier.
Wherein step S1 obtains a sample set of mRNA subcellular location sequences.
Optionally, obtaining a sample set of mRNA subcellular location sequences, comprising the steps of:
s11 obtaining mRNA subcellular position sequence data as positive data and negative data;
s12, the positive data and the negative data are subjected to data processing to obtain a sample set of mRNA subcellular position sequences.
In an embodiment of the invention, the mRNA sequence dataset comprises the following steps positive data and negative data. The training dataset and independent dataset 1 are from the rnallocator database 2.0. 28,829 mRNA sequences were initially obtained, localized to single or multiple subcellular locations. In consideration of the positioning accuracy and the model design, the invention only adopts the unit position sequence. The sample numbers for cytoplasm, endoplasmic reticulum, extracellular region, mitochondria, and nucleus were 6964, 1,998, 1,131, 442, and 6,346, respectively.
In view of the homology bias due to mRNA redundancy, more than 70% of full length was obtained using the NCBI blatclutt program, with homology less than 40% (BLASTCLUST with's 40 and l 0.7' option). The resulting numbers of cytoplasm, endoplasmic reticulum, extracellular, mitochondrial and nucleus are 6376, 1426, 855, 421 and 5831, respectively; these data are randomly divided 1/6 into independent data set 1, the rest in the training data set.
Independent dataset 2 was derived from InLocator and belongs to the IncRNA class. There are 1361 samples of IncRNA in a rnallocator, located in single or multiple subcellular compartments. The multi-position samples are then removed. The hit threshold is set to 80% taking into account the bias caused by redundant sequences. Only cytoplasmic and nuclear sites were retained in the independent data. The independent data set 2 contained only cytoplasm and nucleus, 292 and 146 samples, respectively.
In the embodiment of the invention, the training is carried out by using the sample formed by the positive data and the negative data, so that the identification of the target model obtained by training is more accurate.
In step S2, feature extraction is performed on the mRNA subcellular position sequence sample set according to multiple feature extraction algorithms to obtain multiple feature sets.
Optionally, feature extraction is performed on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms to obtain a plurality of feature sets, where the plurality of feature extraction algorithms include: the method comprises any three or more of an electron-ion interaction trinucleotide algorithm, a trinucleotide composition algorithm, a dinucleotide composition algorithm, a k-spaced nucleic acid pair composition algorithm, a parallel related pseudo-dinucleotide composition algorithm, a parallel related pseudo-trinucleotide composition algorithm, a sequence related pseudo-dinucleotide composition algorithm, a sequence related pseudo-trinucleotide composition algorithm and a dinucleotide-based self-cross covariance algorithm.
In an embodiment of the invention, the feature extraction algorithm comprises the following steps of electron-ion interacting trinucleotide (PseEIP), trinucleotide composition (TNC), dinucleotide composition (DNC), k-spaced nucleic acid pair Composition (CKSNAP), parallel-related pseudo-dinucleotide composition (PCPseDNC), parallel-related pseudo-trinucleotide composition (PCPseTNC), sequence-related pseudo-dinucleotide composition (SCPseDNC), sequence-related pseudo-trinucleotide composition (SCPseTNC) and dinucleotide-based self-cross covariance (DACC).
EllP represents the electron-ion interaction of nucleotides A, G, C and T, with the dimension being the length of the sequence. Subsequently, the average EIIP value of the trinucleotides in each sample was used to construct a feature vector, which was obtained as PseElP. PseEllP describes the average of trinucleotide EIIP values by generating a 64-dimensional vector;
TNC clarified A, G, C and T by developing a 64-dimensional signature to represent the number of trinucleotides;
DNC describes A, G, C and T to represent the number of trinucleotides by generating a 16-dimensional vector;
CKSNAP calculates the frequency of nucleic acid pairs segmented by any 5 nucleic acids, encoding a 96-dimensional vector;
PCPseDNC encodes the local and global sequences as one nucleotide sequence and generates an 18-dimensional feature vector using 38 default physicochemical indices.
The PCPseTNC normalizes the frequency of occurrence of the trinucleotides in the sequence and calculates their correlation with the sequence, resulting in a 66-dimensional vector.
SCPseDNC developed a 28-dimensional feature vector to describe 6 physicochemical properties.
The SCPseTNC uses 2 indices to generate a 68-dimensional feature vector from the dinucleotide values at the corresponding positions.
DACC coding is a method for describing a 72-dimensional feature vector by measuring the correlation of the same physicochemical index between two nucleotides under specific conditions, and by calculating the square of the number of physicochemical indices and the maximum hysteresis index.
In the embodiment of the invention, all nine algorithms are utilized, so that the feature extraction is more comprehensive and targeted.
S3, respectively identifying the feature sets according to the base classifiers, and performing at least one layer of integration on the base classifiers to obtain integrated classifiers;
s4, obtaining a target mRNA subcellular localization model according to the multiple feature extraction algorithms and the integrated classifier.
In the embodiment of the invention, after the characteristics of the training samples are extracted by the 9 characteristic extraction algorithms, the training samples are introduced into at least one layer of integrated base classifiers for repeated training, and the scoring estimation is carried out according to the positive and negative sample intervals, so that the optimal weight distribution of each layer can be obtained, and the trained integrated classifiers are obtained, and the weight values of the trained integrated classifiers are valued according to the target weight parameters, so that the target mRNA subcellular localization model is obtained.
Optionally, step S3, including the steps of:
s31, matching corresponding base classifiers according to the data characteristics and the time complexity of the subcellular positions of the feature set for identification;
s32, performing at least one layer of integration on the base classifier to obtain a target weight parameter;
and S33, obtaining an integrated classifier according to the base classifier and the target weight parameter.
Optionally, the base classifier comprises a LightGBM algorithm.
Optionally, the feature sets are respectively identified according to a plurality of base classifiers, and two-layer integration is performed on the base classifiers to obtain target weight parameters of the integrated classifiers.
Optionally, step S3, including the steps of:
s31, respectively identifying the feature sets according to the base classifiers to obtain a prediction result;
s32, grouping the base classifiers according to the feature sets corresponding to the base classifiers to obtain a base classifier group;
s33, obtaining a target weight parameter according to a grid searching and evaluating algorithm;
s34, obtaining a two-layer integrated classifier according to the base classifier, the base classifier group and the target weight parameter.
Optionally, the evaluation algorithm comprises the steps of: ACC algorithm, call algorithm, Precision algorithm, F1-score algorithm,
the calculation formula is as follows:
Figure BDA0003283101250000051
Figure BDA0003283101250000052
Figure BDA0003283101250000061
Figure BDA0003283101250000062
wherein Recall represents the probability that a true positive sample is predicted to be positive; precision describes the proportion of true positives in all prediction samples; f-score is a coordinated average evaluation index of precision and recall. When the real label is a positive sample, TP and FN respectively represent the number of samples with positive or negative prediction results of the samples; when the true label of the sample is negative, TN and FP indicate that the predicted label is negative or the predicted label is positive, respectively.
The LightGBM algorithm adopted in the embodiment of the invention considers the size of data, the requirement of training efficiency and the special advantages of the LightGBM in the embodiment, and the inventor adopts the LightGBM to perform model training. By reducing the number of data instances and feature selection, the training efficiency of lighttgbm is improved. Consider that the number of training samples is 12410, and use 9 feature vectors (PseEIP, TNC, DNC, CKSNAP, PCPseDNC, PCPseTNC, SCPseDNC, SCPseTNC, and DACC described above) to train the single feature classifier, consider using LightGBM. Furthermore, lightGBM shows a stronger classification capability compared to other machine learning algorithms. Therefore, LightGBM is used for model training.
LightGBM has the advantage of supporting efficient parallelism, including the following steps characteristic of parallel and data parallelism. The main idea of feature parallelism is to find the optimal segmentation points on different feature sets and synchronize the optimal segmentation points between machines. In feature parallel algorithms, data is saved locally to avoid communication of results. The task of histogram combination is divided among different machines, and the communication and calculation amount is reduced. Voting-based data parallelization further optimizes the communication cost of data parallelization, keeping communication at a nearly constant level. Voting parallelism can be used to achieve better acceleration effects on large data samples.
A gradient-boosting decision tree (GBDT) is an integrated model based on decision trees. In each iteration, the GDBT learns the learning tree by fitting a negative gradient; finding the best sharded point is the most expensive and time consuming process. LightGBM is trained with unilateral sampling. Gradient-based one-sided sampling (GOSS) retains large gradient samples, and randomly samples small gradient samples. To ensure the uniformity of the distribution, the small gradient samples are multiplied by a constant when calculating the information gain. LightGBM compares mutually exclusive properties using an optimal policy called exclusive property bundling (EFB) and combines them. In this way, the dimension of a feature may be reduced. The behavior of the base classifier generated at this time is shown in fig. 4.
Therefore, the base classifier with high accuracy and generalization capability is screened out. The performance of the individual feature classifiers was evaluated on independent datasets (fig. 5). Overall, the model performance of the individual features is stable. Furthermore, it should be noted that the performance of PseEIIP is best in the sequence-based feature model in the independent data set 1. The PseEIIP performance is characterized by an accuracy of 0.601, a recall of 0.584, an ACC of 0.584, and an F-score of 0.569. The physical and chemical property classifier is superior to the sequence feature classifier in sequence feature and physical and chemical property feature. PCPseTNC is 0.003, 0.009, and 0.009 higher than PseEIIP in accuracy, recall, ACC, and F-score, respectively, while DNC is lower than PCPseDNC (accuracy 0.013, recall 0.017, ACC 0.017, F-score 0.016).
Ensemble learning is a machine learning method that accomplishes the learning task by constructing multiple machine learning classifiers. Ensemble learning is applied to classification problems. In ensemble learning, there are two difficulties (1) the generation of a single class. (2) A suitable strategy combines the individual base classifiers. Multiple preferred models can be obtained by a single feature or a concatenation of single features. The goal of ensemble learning is to combine a plurality of weak supervision models to generate a model with strong supervision capability and high comprehensive performance. Ensemble learning means that even if one weak classifier gets a wrong prediction, other weak classifiers can correct the error. The generalization ability of the training model can be improved by utilizing ensemble learning.
In this example, when the above-mentioned base classifier is applied to the generated one-layer integrated model, as shown in fig. 5, the performance is improved compared with the base classifier based on a single feature.
For the independent data set 1, the precision of a single integration model based on sequence features (a single feature classifier generated by combining a feature matrix extracted by feature extraction algorithms PseEIIP, TNC, DNC and CKSNAP with LightGBM has been explained, the four single feature classifiers are integrated, namely a single-layer integration model of the sequence features is generated) is 0.617, the recall ratio is 0.597, the ACC is 0.597 and the F-score is 0.577; the accuracy of the single-layer model based on physicochemical properties (single feature classifiers generated by combining feature matrixes extracted by feature extraction algorithms PCPseDNC, PCPseTNC, SCPseDNC, SCPseTNC, DACC with LightGBM, then, the four single feature classifiers are integrated, that is, a physicochemical property single-layer integrated model is generated), the precision is 0.608, the recall ratio is 0.601, the ACC is 0.601, and the F-score is 0.577. Independent dataset 1 shows that single layer integration performs better than single feature based classifiers. The independent data set 2 also confirms this assumption. The performance of PCPseTNC on a single feature classifier (on independent data set 2) was better with an accuracy value of 0.595, a recall of 0.502, an ACC value of 0.502 and an F-score value of 0.544. On independent dataset 2, the performance of the single-layer integration model is also superior to the single-feature classifier (consistent with the performance on independent dataset 1). Interestingly, the best and worst single feature-based classifiers for the sequence-based feature classifier are PseEIIP and DNC, PCPseDNC and PCPseTNC, respectively. Evidence suggests that the difference between the independent data sets 1 and 2 is small for the individual feature models, which means that a single integrated model has relatively stable performance, with no significant overfitting (fig. 5).
Further, the single-layer integration model can be subjected to weighted integration, and the weighting parameters are determined according to a grid search algorithm.
In the embodiment of the invention, a two-layer integration model can be further generated. Different weights are given to the sequence characteristics and the physicochemical characteristics so as to achieve the expected effect. In the examples the model was trained and tested using training data sets and independent data sets 1, 2, respectively. And training a base classifier based on single features by adopting 9 feature coding methods. The prediction scores are averaged over the feature sets and represent the predicted contribution of each feature set in the initial layer of the integrated model. Then, the feature groups at the higher level of the integrated model, i.e., the sequence-based feature group and the physicochemical property-based grouping (i.e., the two base classifier groups, which are called feature groups because they are classified by feature properties), are given different weights. In this process, the weights of the sequence features and the physicochemical features are determined using a grid search. The examples demonstrate the best performance when the ratio of the sequence-based feature set to the physicochemical properties-based feature set is 3: 2.
On independent test set 1, the sequence-based single-layer integration model outperformed the physicochemical property-based single-layer integration model. Therefore, if only a single layer integration model is used, it cannot be determined which single layer integration model works better. That is, in the actual subcellular localization work, for an unknown subcellular location sequence, it cannot be determined which of the two single-layer integrated models has a better recognition effect on the unknown sequence. Therefore, the inventors considered: and (3) constructing two or more layers of integrated models with stronger comprehensive capacity. The performance of the two-layer integration model and the single-layer integration model on the independent test set 1 confirmed the modeling assumption. First, the recall rate, ACC and F-score of the two-layer integrated model are respectively 0.004, 0.004 and 0.001 higher than that of the single-layer integrated model with the best performance, which shows that the comprehensive capability of the two-layer integrated model is superior to that of the single-layer integrated model. In addition, in the model performance evaluation, the smaller the gap between the precision of the model and the recall rate is, the better the model performance is. The accuracy of the two-tier integration model is 0.016 higher than the recall ratio, while the accuracy of the best performing single-tier integration model is 0.02 higher than the recall ratio, further illustrating the necessity of the two-tier integration model (fig. 5).
To further illustrate the necessity of a two-layer integrated model design, the inventors introduced a separate data set 3. Independent test set 3 was derived from iLoc-mRNA and belongs to brain mRNA subcellular localization data. Also, only a single position sample is retained. To reduce data redundancy, the CD-HIT _ EST sets the cutoff value to 80%. Finally, 131 samples were randomly drawn from the endoplasmic reticulum and nucleus, generating independent data set 3. Similarly, the performance of the independent data set 3 also demonstrates the need for a two-layer model. In this case, the single-layer model based on physicochemical properties performed better, with precision, call, ACC, and F-score of 0.938, 0.37, and 0.484, respectively. These values are 0.001, 0.016 and 0.022 lower than the two-layer integration model, respectively. In this case, the accuracy of the two-tier model is 0.552 greater than the recall rate, while the accuracy of the single-tier integrated model is 0.572 greater than the recall rate. Thus, the embodiments show that a two-layer integration model is more efficient than a single-layer integration model.
In comparison to mRNALoc, the precision, call, ACC, and F-score of the two-layer integration model on independent dataset 1 were 0.617, 0.601, and 0.578, respectively. To illustrate the prediction accuracy and generalization capability of the two-layer integrated model, comparisons were made using mRNALoc. For fairness, the mRNALoc model is replicated. The result shows that on the independent data set 1, the two-layer integrated model of the embodiment of the invention is superior to the mRNAloc in the aspects of precision, recall rate, ACC value and F-score, and is respectively 0.07, 0.11 and 0.069 higher. In the case of independent data set 2, the precision and recall values of mRNALoc varied significantly, indicating that there was a bias in the predictive ability of mRNALoc for new data. However, the two-layer integration model is relatively stable in cross-validation and independent datasets. Precision, recall, ACC and F-score were 0.052, 0.241, 0.24 and 0.191 higher than mRNAloc, respectively. From the results of independent data sets 1 and 2, the two-layer integrated model is significantly better in versatility than the mRNALoc model. Meanwhile, the embodiment of the invention also performs performance comparison with other existing tools. The comparison experiment result shows that the two-layer integrated model has accurate prediction capability and strong generalization capability, which is very important for the multi-classification model (fig. 6). The results show that the two-layer integrated model performed best on independent test set 1 compared to the other two models. Both layers of the integrated model were significantly higher than iLocmRNA and RNATracker. The good performance of the two-layer integration model was also demonstrated on independent data sets. Taken together, these results indicate that the two-layer integration model is superior to the current state-of-the-art approach in subcellular localization (fig. 6).
According to an embodiment of the present invention, the present invention also provides a method for subcellular localization of mRNA, comprising the steps of:
obtaining mRNA subcellular position sequence samples;
and identifying the mRNA subcellular location sequence sample by using the target mRNA subcellular location model to obtain a location prediction result.
In this example, the trained target model provided by the present invention can be used to predict mRNA subcellular localization. One step that can be embodied is:
1) acquiring a sample set of mRNA subcellular position sequences to be identified;
2) performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms (such as any combination or all of the 9 methods) to obtain a plurality of feature sets;
3) and identifying the plurality of feature sets according to the integrated classifier to obtain the mRNA subcellular localization of the identification sequence.
The invention also provides a computer-readable storage medium, on which a computer program is stored, according to an embodiment of the invention, characterized in that the computer program realizes the method steps as described above when executed by a processor.
That is, the computer readable storage medium may run the above-described mRNA subcellular localization method or the training method of the mRNA subcellular localization model.
In describing the steps of the invention in the claims and the specification, the terms S1, S2, S3, S4, one, two, three, 1, 2, 3, 4, 5 do not represent absolute chronological or sequential order, and do not represent logical divisions between absolute steps, and the order of steps and division manner should be reasonably adjusted by those skilled in the art on the premise that the purpose of the invention can be achieved.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention. As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the following preferred embodiments of the invention and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A training method of an mRNA subcellular localization model is characterized by comprising the following steps:
s1, acquiring a mRNA subcellular position sequence sample set;
s2, performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms to obtain a plurality of feature sets;
s3, respectively identifying the feature sets according to the base classifiers, and performing at least one layer of integration on the base classifiers to obtain integrated classifiers;
s4, obtaining a target mRNA subcellular localization model according to the multiple feature extraction algorithms and the integrated classifier.
2. The method for training the mRNA subcellular localization model according to claim 1, wherein the step S1 comprises the following steps:
s11, mRNA subcellular position sequence data is obtained to be used as positive data, and a non-mRNA subcellular position sequence file is used as negative data;
s12, the positive data and the negative data are subjected to data processing to obtain a sample set of mRNA subcellular position sequences.
3. The method for training the mRNA subcellular localization model according to claim 1, wherein in step S2, the plurality of feature extraction algorithms includes: the method comprises any three or more of an electron-ion interaction trinucleotide algorithm, a trinucleotide composition algorithm, a dinucleotide composition algorithm, a k-spaced nucleic acid pair composition algorithm, a parallel related pseudo-dinucleotide composition algorithm, a parallel related pseudo-trinucleotide composition algorithm, a sequence related pseudo-dinucleotide composition algorithm, a sequence related pseudo-trinucleotide composition algorithm and a dinucleotide-based self-cross covariance algorithm.
4. The method for training the mRNA subcellular localization model according to claim 1, wherein the step S3 comprises the following steps:
s31, matching corresponding base classifiers according to the data characteristics and the time complexity of the subcellular positions of the feature set for identification;
s32, performing at least one layer of integration on the base classifier to obtain a target weight parameter;
and S33, obtaining an integrated classifier according to the base classifier and the target weight parameter.
5. The method of claim 3, wherein the base classifier comprises a LightGBM algorithm.
6. The method for training the mRNA subcellular localization model according to any one of claims 1 to 4, wherein the step S3 is: and respectively identifying the feature sets according to the base classifiers, and performing two-layer integration on the base classifiers to obtain two-layer integrated classifiers.
7. The method for training the mRNA subcellular localization model according to claim 6, wherein the step S3 includes the following steps:
s31, respectively identifying the feature sets according to the base classifiers to obtain a prediction result;
s32, grouping the base classifiers according to the feature sets corresponding to the base classifiers to obtain a base classifier group;
s33, obtaining a target weight parameter according to a grid searching and evaluating algorithm;
s34, obtaining a two-layer integrated classifier according to the base classifier, the base classifier group and the target weight parameter.
8. The method for training the mRNA subcellular localization model according to claim 7, wherein the evaluation algorithm comprises the following steps: ACC algorithm, call algorithm, Precision algorithm, F1-score algorithm; the calculation formula is as follows:
Figure FDA0003283101240000021
Figure FDA0003283101240000022
Figure FDA0003283101240000023
Figure FDA0003283101240000024
when the real label is a positive sample, TP and FN respectively represent the number of samples with positive or negative prediction results of the samples; when the true label of the sample is negative, TN and FP indicate that the predicted label is negative or the predicted label is positive, respectively.
9. A method for subcellular localization of mRNA comprising the steps of:
obtaining mRNA subcellular position sequence samples;
the mRNA subcellular localization model of any one of claims 1-8 is used for identifying mRNA subcellular location sequence samples to obtain a localization prediction result.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 9.
CN202111138369.5A 2021-09-27 2021-09-27 MRNA subcellular localization model training method, positioning method and readable storage medium Active CN113837293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111138369.5A CN113837293B (en) 2021-09-27 2021-09-27 MRNA subcellular localization model training method, positioning method and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111138369.5A CN113837293B (en) 2021-09-27 2021-09-27 MRNA subcellular localization model training method, positioning method and readable storage medium

Publications (2)

Publication Number Publication Date
CN113837293A true CN113837293A (en) 2021-12-24
CN113837293B CN113837293B (en) 2024-08-27

Family

ID=78970691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111138369.5A Active CN113837293B (en) 2021-09-27 2021-09-27 MRNA subcellular localization model training method, positioning method and readable storage medium

Country Status (1)

Country Link
CN (1) CN113837293B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953973A (en) * 2024-03-21 2024-04-30 电子科技大学长三角研究院(衢州) Specific biological sequence prediction method and system based on sequence homology

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201607521D0 (en) * 2016-04-29 2016-06-15 Oncolmmunity As Method
CN109390037A (en) * 2018-10-08 2019-02-26 齐齐哈尔大学 The full site recognition methods of mature miRNA based on SVM-AdaBoost
CN109872773A (en) * 2019-02-26 2019-06-11 哈尔滨工业大学 Mirco-RNA precursor recognition methods based on the fusion of Adaboost, BP neural network and random forest
CN109920477A (en) * 2019-02-26 2019-06-21 哈尔滨工业大学 The several species Pre-microRNA recognition methods merged based on Adaboost with BP neural network
CN110379464A (en) * 2019-07-29 2019-10-25 桂林电子科技大学 The prediction technique of DNA transcription terminator in a kind of bacterium
CN110415765A (en) * 2019-07-29 2019-11-05 桂林电子科技大学 A kind of prediction technique of long-chain non-coding RNA subcellular localization
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site
CN111816255A (en) * 2020-07-09 2020-10-23 江南大学 RNA-binding protein recognition by fusing multi-view and optimal multi-tag chain learning
CN112508116A (en) * 2020-12-15 2021-03-16 吉林大学 Classifier generation method and device, storage medium and electronic equipment
KR20210050362A (en) * 2019-10-28 2021-05-07 주식회사 모비스 Ensemble pruning method, ensemble model generation method for identifying programmable nucleases and apparatus for the same
CN112927217A (en) * 2021-03-23 2021-06-08 内蒙古大学 Thyroid nodule invasiveness prediction method based on target detection
CN113177927A (en) * 2021-05-17 2021-07-27 西安交通大学 Bone marrow cell classification and identification method and system based on multiple features and multiple classifiers

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201607521D0 (en) * 2016-04-29 2016-06-15 Oncolmmunity As Method
CN109390037A (en) * 2018-10-08 2019-02-26 齐齐哈尔大学 The full site recognition methods of mature miRNA based on SVM-AdaBoost
CN109872773A (en) * 2019-02-26 2019-06-11 哈尔滨工业大学 Mirco-RNA precursor recognition methods based on the fusion of Adaboost, BP neural network and random forest
CN109920477A (en) * 2019-02-26 2019-06-21 哈尔滨工业大学 The several species Pre-microRNA recognition methods merged based on Adaboost with BP neural network
CN110379464A (en) * 2019-07-29 2019-10-25 桂林电子科技大学 The prediction technique of DNA transcription terminator in a kind of bacterium
CN110415765A (en) * 2019-07-29 2019-11-05 桂林电子科技大学 A kind of prediction technique of long-chain non-coding RNA subcellular localization
KR20210050362A (en) * 2019-10-28 2021-05-07 주식회사 모비스 Ensemble pruning method, ensemble model generation method for identifying programmable nucleases and apparatus for the same
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site
CN111816255A (en) * 2020-07-09 2020-10-23 江南大学 RNA-binding protein recognition by fusing multi-view and optimal multi-tag chain learning
CN112508116A (en) * 2020-12-15 2021-03-16 吉林大学 Classifier generation method and device, storage medium and electronic equipment
CN112927217A (en) * 2021-03-23 2021-06-08 内蒙古大学 Thyroid nodule invasiveness prediction method based on target detection
CN113177927A (en) * 2021-05-17 2021-07-27 西安交通大学 Bone marrow cell classification and identification method and system based on multiple features and multiple classifiers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DUOLIN WANG, ZHAOYUE ZHANG, YUEXU JIANG, ZITING MAO, DONG WANG,HAO LIN , DONG XU: "DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism", NUCLEIC ACIDS RESEARCH, 27 January 2021 (2021-01-27) *
李大鹏,鞠颖,廖之君,邹权: "与肿瘤相关的计算microRNA组学研究综述", 《生物信息学》, 31 December 2015 (2015-12-31) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953973A (en) * 2024-03-21 2024-04-30 电子科技大学长三角研究院(衢州) Specific biological sequence prediction method and system based on sequence homology

Also Published As

Publication number Publication date
CN113837293B (en) 2024-08-27

Similar Documents

Publication Publication Date Title
Aliniya et al. A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm
Srinivas et al. A hybrid CNN-KNN model for MRI brain tumor classification
CN112035620B (en) Question-answer management method, device, equipment and storage medium of medical query system
CN107291895B (en) Quick hierarchical document query method
CN112489723B (en) DNA binding protein prediction method based on local evolution information
CN114093422B (en) Prediction method and system for interaction between miRNA and gene based on multiple relationship graph rolling network
CN113257357B (en) Protein residue contact map prediction method
Fan et al. lncLocPred: predicting LncRNA subcellular localization using multiple sequence feature information
Babu et al. A comparative study of gene selection methods for cancer classification using microarray data
Beltran et al. Predicting protein-protein interactions based on biological information using extreme gradient boosting
Ye et al. Adaptive unsupervised feature learning for gene signature identification in non-small-cell lung cancer
CN113837293B (en) MRNA subcellular localization model training method, positioning method and readable storage medium
CN111048145B (en) Method, apparatus, device and storage medium for generating protein prediction model
CN116705192A (en) Drug virtual screening method and device based on deep learning
Yang et al. PseKNC and Adaboost-based method for DNA-binding proteins recognition
Elshazly et al. Lymph diseases diagnosis approach based on support vector machines with different kernel functions
CN115881211B (en) Protein sequence alignment method, protein sequence alignment device, computer equipment and storage medium
Wei et al. Feature distribution fitting with direction-driven weighting for few-shot images classification
CN113724779A (en) SNAREs protein identification method, system, storage medium and equipment based on machine learning technology
Navamajiti et al. McBel-Plnc: a deep learning model for multiclass multilabel classification of protein-lncRNA interactions
McClannahan et al. Classification of Long Noncoding RNA Elements Using Deep Convolutional Neural Networks and Siamese Networks
CN111383710A (en) Gene splice site recognition model construction method based on particle swarm optimization gemini support vector machine
CN116910660B (en) Self-step semi-supervised integrated classifier training method and system for unbalanced data
Arango-Argoty et al. An adaptation of Pfam profiles to predict protein sub-cellular localization in Gram positive bacteria
CN113838520B (en) III type secretion system effector protein identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant