CN113837293A

CN113837293A - mRNA subcellular localization model training method, mRNA subcellular localization model localization method and readable storage medium

Info

Publication number: CN113837293A
Application number: CN202111138369.5A
Authority: CN
Inventors: 邹权; 李静; 杜军平
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-24
Anticipated expiration: 2041-09-27
Also published as: CN113837293B

Abstract

The invention provides a training method of an mRNA subcellular localization model, which comprises the following steps: acquiring a sample set of mRNA subcellular position sequences; and performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms, identifying features by using a base classifier respectively, integrating the base classifier by more than one layer, and acquiring a target mRNA subcellular positioning model according to the feature extraction algorithms and the base classifier. According to the invention, through the integrated learning and training of a plurality of classifiers, the training efficiency can be improved, so that the model can obtain a global optimal solution more easily in the training process, and the target model after the training has more excellent prediction capability and generalization capability.

Description

mRNA subcellular localization model training method, mRNA subcellular localization model localization method and readable storage medium

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an mRNA subcellular localization model training method, an mRNA subcellular localization model localization method and a readable storage medium.

Background

Currently, subcellular localization of RNA is considered to be an important mechanism for cellular polarization during the development of unicellular organisms, animal and plant tissues, and animal embryos. Localization of mRNA transcripts has been shown to spatially localize gene expression and protein transcription. Approximately 80% of the transcripts are distributed asymmetrically in human cells, and mis-localization of transcripts may lead to diseases such as spinal muscular atrophy, alzheimer's disease, and cancer. In recent years, subcellular localization algorithms based on machine learning have been developed in great quantities. The mRNA localization corresponds to the localization of protein translation, and contributes to the study of protein function. However, current studies on eukaryotic mRNA subcellular localization show significant limitations, often based on the extraction of single sequence information, and are deficient in predictive and generalization capabilities. In order to better achieve the subcellular localization of eukaryotic mRNA, models with better performance and more comprehensive functions must be established and trained.

Disclosure of Invention

The invention provides a training method and a positioning method of an mRNA subcellular localization model and a readable storage medium, aiming at the problem of insufficient single sequence information-based extraction, prediction capability and generalization capability in the prior art.

According to an embodiment of the present invention, the present invention provides a training method of an mRNA subcellular localization model, comprising the following steps:

s1, acquiring a mRNA subcellular position sequence sample set;

s2, performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms to obtain a plurality of feature sets;

s3, respectively identifying the feature sets according to the base classifiers, and performing at least one layer of integration on the base classifiers to obtain integrated classifiers;

s4, obtaining a target mRNA subcellular localization model according to the multiple feature extraction algorithms and the integrated classifier.

Optionally, step S1, including the steps of:

s11 obtaining mRNA subcellular position sequence data as positive data and negative data;

s12, the positive data and the negative data are subjected to data processing to obtain a sample set of mRNA subcellular position sequences.

Optionally, in step S2, the plurality of feature extraction algorithms includes: the method comprises any three or more of an electron-ion interaction trinucleotide algorithm, a trinucleotide composition algorithm, a dinucleotide composition algorithm, a k-spaced nucleic acid pair composition algorithm, a parallel related pseudo-dinucleotide composition algorithm, a parallel related pseudo-trinucleotide composition algorithm, a sequence related pseudo-dinucleotide composition algorithm, a sequence related pseudo-trinucleotide composition algorithm and a dinucleotide-based self-cross covariance algorithm.

Optionally, step S3, including the steps of:

s31, matching corresponding base classifiers according to the data characteristics and the time complexity of the subcellular positions of the feature set for identification;

s32, performing at least one layer of integration on the base classifier to obtain a target weight parameter;

and S33, obtaining an integrated classifier according to the base classifier and the target weight parameter.

Optionally, the base classifier comprises a LightGBM algorithm.

Optionally, step S3 is: and respectively identifying the feature sets according to the base classifiers, and performing two-layer integration on the base classifiers to obtain the target weight parameters of the integrated classifiers.

Optionally, step S3, including the steps of:

s31, respectively identifying the feature sets according to the base classifiers to obtain a prediction result;

s32, grouping the base classifiers according to the feature sets corresponding to the base classifiers to obtain a base classifier group;

s33, obtaining a target weight parameter according to a grid searching and evaluating algorithm;

s34, obtaining a two-layer integrated classifier according to the base classifier, the base classifier group and the target weight parameter.

Optionally, the evaluation algorithm comprises the steps of: ACC algorithm, call algorithm, Precision algorithm, F1-score algorithm; the calculation formula is as follows:

when the real label is a positive sample, TP and FN respectively represent the number of samples with positive or negative prediction results of the samples; when the true label of the sample is negative, TN and FP indicate that the predicted label is negative or the predicted label is positive, respectively.

According to an embodiment of the present invention, the present invention also provides a method for subcellular localization of mRNA, comprising the steps of:

obtaining mRNA subcellular position sequence samples;

and identifying the mRNA subcellular location sequence sample by using the target mRNA subcellular location model to obtain a location prediction result.

The invention also provides a computer-readable storage medium, on which a computer program is stored, according to an embodiment of the invention, characterized in that the computer program realizes the method steps as described above when executed by a processor.

The invention has the beneficial effects that:

the method comprises the steps of extracting features of an mRNA subcellular position sequence sample set through various feature extraction algorithms, identifying a plurality of feature sets by utilizing a plurality of base classifiers respectively, integrating at least one layer of the base classifiers, training and learning a two-layer model through the mRNA subcellular position sequence sample set, and training the structure of the two layers. After the target model is used for identifying the mRNA subcellular position sequence sample needing to be predicted, the positioning prediction result can be more accurate.

Drawings

FIG. 1 is a flowchart of a method for training a model for mRNA subcellular localization according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a model for mRNA subcellular localization according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for training a model for mRNA subcellular localization according to an embodiment of the present invention;

FIG. 4 is a graph comparing the performance of a single feature-based classifier, a single-layer integration model, and a two-layer integration model according to an embodiment of the present invention;

FIG. 5 is a graph illustrating a comparison of performance between a single feature-based classifier and a single-layer integration model according to an embodiment of the present invention;

FIG. 6 is a graph comparing the performance of the model of the present invention with other models of mRNA subcellular localization.

Detailed Description

Referring to fig. 1 to 3, the present invention provides a method for training an mRNA subcellular localization model, comprising the following steps: s1, acquiring a mRNA subcellular position sequence sample set;

Wherein step S1 obtains a sample set of mRNA subcellular location sequences.

Optionally, obtaining a sample set of mRNA subcellular location sequences, comprising the steps of:

In an embodiment of the invention, the mRNA sequence dataset comprises the following steps positive data and negative data. The training dataset and independent dataset 1 are from the rnallocator database 2.0. 28,829 mRNA sequences were initially obtained, localized to single or multiple subcellular locations. In consideration of the positioning accuracy and the model design, the invention only adopts the unit position sequence. The sample numbers for cytoplasm, endoplasmic reticulum, extracellular region, mitochondria, and nucleus were 6964, 1,998, 1,131, 442, and 6,346, respectively.

In view of the homology bias due to mRNA redundancy, more than 70% of full length was obtained using the NCBI blatclutt program, with homology less than 40% (BLASTCLUST with's 40 and l 0.7' option). The resulting numbers of cytoplasm, endoplasmic reticulum, extracellular, mitochondrial and nucleus are 6376, 1426, 855, 421 and 5831, respectively; these data are randomly divided 1/6 into independent data set 1, the rest in the training data set.

Independent dataset 2 was derived from InLocator and belongs to the IncRNA class. There are 1361 samples of IncRNA in a rnallocator, located in single or multiple subcellular compartments. The multi-position samples are then removed. The hit threshold is set to 80% taking into account the bias caused by redundant sequences. Only cytoplasmic and nuclear sites were retained in the independent data. The independent data set 2 contained only cytoplasm and nucleus, 292 and 146 samples, respectively.

In the embodiment of the invention, the training is carried out by using the sample formed by the positive data and the negative data, so that the identification of the target model obtained by training is more accurate.

In step S2, feature extraction is performed on the mRNA subcellular position sequence sample set according to multiple feature extraction algorithms to obtain multiple feature sets.

Optionally, feature extraction is performed on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms to obtain a plurality of feature sets, where the plurality of feature extraction algorithms include: the method comprises any three or more of an electron-ion interaction trinucleotide algorithm, a trinucleotide composition algorithm, a dinucleotide composition algorithm, a k-spaced nucleic acid pair composition algorithm, a parallel related pseudo-dinucleotide composition algorithm, a parallel related pseudo-trinucleotide composition algorithm, a sequence related pseudo-dinucleotide composition algorithm, a sequence related pseudo-trinucleotide composition algorithm and a dinucleotide-based self-cross covariance algorithm.

In an embodiment of the invention, the feature extraction algorithm comprises the following steps of electron-ion interacting trinucleotide (PseEIP), trinucleotide composition (TNC), dinucleotide composition (DNC), k-spaced nucleic acid pair Composition (CKSNAP), parallel-related pseudo-dinucleotide composition (PCPseDNC), parallel-related pseudo-trinucleotide composition (PCPseTNC), sequence-related pseudo-dinucleotide composition (SCPseDNC), sequence-related pseudo-trinucleotide composition (SCPseTNC) and dinucleotide-based self-cross covariance (DACC).

EllP represents the electron-ion interaction of nucleotides A, G, C and T, with the dimension being the length of the sequence. Subsequently, the average EIIP value of the trinucleotides in each sample was used to construct a feature vector, which was obtained as PseElP. PseEllP describes the average of trinucleotide EIIP values by generating a 64-dimensional vector;

TNC clarified A, G, C and T by developing a 64-dimensional signature to represent the number of trinucleotides;

DNC describes A, G, C and T to represent the number of trinucleotides by generating a 16-dimensional vector;

CKSNAP calculates the frequency of nucleic acid pairs segmented by any 5 nucleic acids, encoding a 96-dimensional vector;

PCPseDNC encodes the local and global sequences as one nucleotide sequence and generates an 18-dimensional feature vector using 38 default physicochemical indices.

The PCPseTNC normalizes the frequency of occurrence of the trinucleotides in the sequence and calculates their correlation with the sequence, resulting in a 66-dimensional vector.

SCPseDNC developed a 28-dimensional feature vector to describe 6 physicochemical properties.

The SCPseTNC uses 2 indices to generate a 68-dimensional feature vector from the dinucleotide values at the corresponding positions.

DACC coding is a method for describing a 72-dimensional feature vector by measuring the correlation of the same physicochemical index between two nucleotides under specific conditions, and by calculating the square of the number of physicochemical indices and the maximum hysteresis index.

In the embodiment of the invention, all nine algorithms are utilized, so that the feature extraction is more comprehensive and targeted.

In the embodiment of the invention, after the characteristics of the training samples are extracted by the 9 characteristic extraction algorithms, the training samples are introduced into at least one layer of integrated base classifiers for repeated training, and the scoring estimation is carried out according to the positive and negative sample intervals, so that the optimal weight distribution of each layer can be obtained, and the trained integrated classifiers are obtained, and the weight values of the trained integrated classifiers are valued according to the target weight parameters, so that the target mRNA subcellular localization model is obtained.

Optionally, step S3, including the steps of:

Optionally, the base classifier comprises a LightGBM algorithm.

Optionally, the feature sets are respectively identified according to a plurality of base classifiers, and two-layer integration is performed on the base classifiers to obtain target weight parameters of the integrated classifiers.

Optionally, step S3, including the steps of:

Optionally, the evaluation algorithm comprises the steps of: ACC algorithm, call algorithm, Precision algorithm, F1-score algorithm,

the calculation formula is as follows:

wherein Recall represents the probability that a true positive sample is predicted to be positive; precision describes the proportion of true positives in all prediction samples; f-score is a coordinated average evaluation index of precision and recall. When the real label is a positive sample, TP and FN respectively represent the number of samples with positive or negative prediction results of the samples; when the true label of the sample is negative, TN and FP indicate that the predicted label is negative or the predicted label is positive, respectively.

The LightGBM algorithm adopted in the embodiment of the invention considers the size of data, the requirement of training efficiency and the special advantages of the LightGBM in the embodiment, and the inventor adopts the LightGBM to perform model training. By reducing the number of data instances and feature selection, the training efficiency of lighttgbm is improved. Consider that the number of training samples is 12410, and use 9 feature vectors (PseEIP, TNC, DNC, CKSNAP, PCPseDNC, PCPseTNC, SCPseDNC, SCPseTNC, and DACC described above) to train the single feature classifier, consider using LightGBM. Furthermore, lightGBM shows a stronger classification capability compared to other machine learning algorithms. Therefore, LightGBM is used for model training.

LightGBM has the advantage of supporting efficient parallelism, including the following steps characteristic of parallel and data parallelism. The main idea of feature parallelism is to find the optimal segmentation points on different feature sets and synchronize the optimal segmentation points between machines. In feature parallel algorithms, data is saved locally to avoid communication of results. The task of histogram combination is divided among different machines, and the communication and calculation amount is reduced. Voting-based data parallelization further optimizes the communication cost of data parallelization, keeping communication at a nearly constant level. Voting parallelism can be used to achieve better acceleration effects on large data samples.

A gradient-boosting decision tree (GBDT) is an integrated model based on decision trees. In each iteration, the GDBT learns the learning tree by fitting a negative gradient; finding the best sharded point is the most expensive and time consuming process. LightGBM is trained with unilateral sampling. Gradient-based one-sided sampling (GOSS) retains large gradient samples, and randomly samples small gradient samples. To ensure the uniformity of the distribution, the small gradient samples are multiplied by a constant when calculating the information gain. LightGBM compares mutually exclusive properties using an optimal policy called exclusive property bundling (EFB) and combines them. In this way, the dimension of a feature may be reduced. The behavior of the base classifier generated at this time is shown in fig. 4.

Therefore, the base classifier with high accuracy and generalization capability is screened out. The performance of the individual feature classifiers was evaluated on independent datasets (fig. 5). Overall, the model performance of the individual features is stable. Furthermore, it should be noted that the performance of PseEIIP is best in the sequence-based feature model in the independent data set 1. The PseEIIP performance is characterized by an accuracy of 0.601, a recall of 0.584, an ACC of 0.584, and an F-score of 0.569. The physical and chemical property classifier is superior to the sequence feature classifier in sequence feature and physical and chemical property feature. PCPseTNC is 0.003, 0.009, and 0.009 higher than PseEIIP in accuracy, recall, ACC, and F-score, respectively, while DNC is lower than PCPseDNC (accuracy 0.013, recall 0.017, ACC 0.017, F-score 0.016).

Ensemble learning is a machine learning method that accomplishes the learning task by constructing multiple machine learning classifiers. Ensemble learning is applied to classification problems. In ensemble learning, there are two difficulties (1) the generation of a single class. (2) A suitable strategy combines the individual base classifiers. Multiple preferred models can be obtained by a single feature or a concatenation of single features. The goal of ensemble learning is to combine a plurality of weak supervision models to generate a model with strong supervision capability and high comprehensive performance. Ensemble learning means that even if one weak classifier gets a wrong prediction, other weak classifiers can correct the error. The generalization ability of the training model can be improved by utilizing ensemble learning.

In this example, when the above-mentioned base classifier is applied to the generated one-layer integrated model, as shown in fig. 5, the performance is improved compared with the base classifier based on a single feature.

For the independent data set 1, the precision of a single integration model based on sequence features (a single feature classifier generated by combining a feature matrix extracted by feature extraction algorithms PseEIIP, TNC, DNC and CKSNAP with LightGBM has been explained, the four single feature classifiers are integrated, namely a single-layer integration model of the sequence features is generated) is 0.617, the recall ratio is 0.597, the ACC is 0.597 and the F-score is 0.577; the accuracy of the single-layer model based on physicochemical properties (single feature classifiers generated by combining feature matrixes extracted by feature extraction algorithms PCPseDNC, PCPseTNC, SCPseDNC, SCPseTNC, DACC with LightGBM, then, the four single feature classifiers are integrated, that is, a physicochemical property single-layer integrated model is generated), the precision is 0.608, the recall ratio is 0.601, the ACC is 0.601, and the F-score is 0.577. Independent dataset 1 shows that single layer integration performs better than single feature based classifiers. The independent data set 2 also confirms this assumption. The performance of PCPseTNC on a single feature classifier (on independent data set 2) was better with an accuracy value of 0.595, a recall of 0.502, an ACC value of 0.502 and an F-score value of 0.544. On independent dataset 2, the performance of the single-layer integration model is also superior to the single-feature classifier (consistent with the performance on independent dataset 1). Interestingly, the best and worst single feature-based classifiers for the sequence-based feature classifier are PseEIIP and DNC, PCPseDNC and PCPseTNC, respectively. Evidence suggests that the difference between the

independent data sets

1 and 2 is small for the individual feature models, which means that a single integrated model has relatively stable performance, with no significant overfitting (fig. 5).

Further, the single-layer integration model can be subjected to weighted integration, and the weighting parameters are determined according to a grid search algorithm.

In the embodiment of the invention, a two-layer integration model can be further generated. Different weights are given to the sequence characteristics and the physicochemical characteristics so as to achieve the expected effect. In the examples the model was trained and tested using training data sets and

independent data sets

1, 2, respectively. And training a base classifier based on single features by adopting 9 feature coding methods. The prediction scores are averaged over the feature sets and represent the predicted contribution of each feature set in the initial layer of the integrated model. Then, the feature groups at the higher level of the integrated model, i.e., the sequence-based feature group and the physicochemical property-based grouping (i.e., the two base classifier groups, which are called feature groups because they are classified by feature properties), are given different weights. In this process, the weights of the sequence features and the physicochemical features are determined using a grid search. The examples demonstrate the best performance when the ratio of the sequence-based feature set to the physicochemical properties-based feature set is 3: 2.

On independent test set 1, the sequence-based single-layer integration model outperformed the physicochemical property-based single-layer integration model. Therefore, if only a single layer integration model is used, it cannot be determined which single layer integration model works better. That is, in the actual subcellular localization work, for an unknown subcellular location sequence, it cannot be determined which of the two single-layer integrated models has a better recognition effect on the unknown sequence. Therefore, the inventors considered: and (3) constructing two or more layers of integrated models with stronger comprehensive capacity. The performance of the two-layer integration model and the single-layer integration model on the independent test set 1 confirmed the modeling assumption. First, the recall rate, ACC and F-score of the two-layer integrated model are respectively 0.004, 0.004 and 0.001 higher than that of the single-layer integrated model with the best performance, which shows that the comprehensive capability of the two-layer integrated model is superior to that of the single-layer integrated model. In addition, in the model performance evaluation, the smaller the gap between the precision of the model and the recall rate is, the better the model performance is. The accuracy of the two-tier integration model is 0.016 higher than the recall ratio, while the accuracy of the best performing single-tier integration model is 0.02 higher than the recall ratio, further illustrating the necessity of the two-tier integration model (fig. 5).

To further illustrate the necessity of a two-layer integrated model design, the inventors introduced a separate data set 3. Independent test set 3 was derived from iLoc-mRNA and belongs to brain mRNA subcellular localization data. Also, only a single position sample is retained. To reduce data redundancy, the CD-HIT _ EST sets the cutoff value to 80%. Finally, 131 samples were randomly drawn from the endoplasmic reticulum and nucleus, generating independent data set 3. Similarly, the performance of the independent data set 3 also demonstrates the need for a two-layer model. In this case, the single-layer model based on physicochemical properties performed better, with precision, call, ACC, and F-score of 0.938, 0.37, and 0.484, respectively. These values are 0.001, 0.016 and 0.022 lower than the two-layer integration model, respectively. In this case, the accuracy of the two-tier model is 0.552 greater than the recall rate, while the accuracy of the single-tier integrated model is 0.572 greater than the recall rate. Thus, the embodiments show that a two-layer integration model is more efficient than a single-layer integration model.

In comparison to mRNALoc, the precision, call, ACC, and F-score of the two-layer integration model on independent dataset 1 were 0.617, 0.601, and 0.578, respectively. To illustrate the prediction accuracy and generalization capability of the two-layer integrated model, comparisons were made using mRNALoc. For fairness, the mRNALoc model is replicated. The result shows that on the independent data set 1, the two-layer integrated model of the embodiment of the invention is superior to the mRNAloc in the aspects of precision, recall rate, ACC value and F-score, and is respectively 0.07, 0.11 and 0.069 higher. In the case of independent data set 2, the precision and recall values of mRNALoc varied significantly, indicating that there was a bias in the predictive ability of mRNALoc for new data. However, the two-layer integration model is relatively stable in cross-validation and independent datasets. Precision, recall, ACC and F-score were 0.052, 0.241, 0.24 and 0.191 higher than mRNAloc, respectively. From the results of

independent data sets

1 and 2, the two-layer integrated model is significantly better in versatility than the mRNALoc model. Meanwhile, the embodiment of the invention also performs performance comparison with other existing tools. The comparison experiment result shows that the two-layer integrated model has accurate prediction capability and strong generalization capability, which is very important for the multi-classification model (fig. 6). The results show that the two-layer integrated model performed best on independent test set 1 compared to the other two models. Both layers of the integrated model were significantly higher than iLocmRNA and RNATracker. The good performance of the two-layer integration model was also demonstrated on independent data sets. Taken together, these results indicate that the two-layer integration model is superior to the current state-of-the-art approach in subcellular localization (fig. 6).

obtaining mRNA subcellular position sequence samples;

In this example, the trained target model provided by the present invention can be used to predict mRNA subcellular localization. One step that can be embodied is:

1) acquiring a sample set of mRNA subcellular position sequences to be identified;

2) performing feature extraction on the mRNA subcellular position sequence sample set according to a plurality of feature extraction algorithms (such as any combination or all of the 9 methods) to obtain a plurality of feature sets;

3) and identifying the plurality of feature sets according to the integrated classifier to obtain the mRNA subcellular localization of the identification sequence.

That is, the computer readable storage medium may run the above-described mRNA subcellular localization method or the training method of the mRNA subcellular localization model.

In describing the steps of the invention in the claims and the specification, the terms S1, S2, S3, S4, one, two, three, 1, 2, 3, 4, 5 do not represent absolute chronological or sequential order, and do not represent logical divisions between absolute steps, and the order of steps and division manner should be reasonably adjusted by those skilled in the art on the premise that the purpose of the invention can be achieved.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention. As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the following preferred embodiments of the invention and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A training method of an mRNA subcellular localization model is characterized by comprising the following steps:

s1, acquiring a mRNA subcellular position sequence sample set;

2. The method for training the mRNA subcellular localization model according to claim 1, wherein the step S1 comprises the following steps:

s11, mRNA subcellular position sequence data is obtained to be used as positive data, and a non-mRNA subcellular position sequence file is used as negative data;

3. The method for training the mRNA subcellular localization model according to claim 1, wherein in step S2, the plurality of feature extraction algorithms includes: the method comprises any three or more of an electron-ion interaction trinucleotide algorithm, a trinucleotide composition algorithm, a dinucleotide composition algorithm, a k-spaced nucleic acid pair composition algorithm, a parallel related pseudo-dinucleotide composition algorithm, a parallel related pseudo-trinucleotide composition algorithm, a sequence related pseudo-dinucleotide composition algorithm, a sequence related pseudo-trinucleotide composition algorithm and a dinucleotide-based self-cross covariance algorithm.

4. The method for training the mRNA subcellular localization model according to claim 1, wherein the step S3 comprises the following steps:

5. The method of claim 3, wherein the base classifier comprises a LightGBM algorithm.

6. The method for training the mRNA subcellular localization model according to any one of claims 1 to 4, wherein the step S3 is: and respectively identifying the feature sets according to the base classifiers, and performing two-layer integration on the base classifiers to obtain two-layer integrated classifiers.

7. The method for training the mRNA subcellular localization model according to claim 6, wherein the step S3 includes the following steps:

8. The method for training the mRNA subcellular localization model according to claim 7, wherein the evaluation algorithm comprises the following steps: ACC algorithm, call algorithm, Precision algorithm, F1-score algorithm; the calculation formula is as follows:

9. A method for subcellular localization of mRNA comprising the steps of:

obtaining mRNA subcellular position sequence samples;

the mRNA subcellular localization model of any one of claims 1-8 is used for identifying mRNA subcellular location sequence samples to obtain a localization prediction result.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 9.