CN111755070A - Cascade decision system-based CircRNA function prediction method - Google Patents

Cascade decision system-based CircRNA function prediction method Download PDF

Info

Publication number
CN111755070A
CN111755070A CN201910246724.7A CN201910246724A CN111755070A CN 111755070 A CN111755070 A CN 111755070A CN 201910246724 A CN201910246724 A CN 201910246724A CN 111755070 A CN111755070 A CN 111755070A
Authority
CN
China
Prior art keywords
circrna
algorithm
lightgbm
data
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910246724.7A
Other languages
Chinese (zh)
Inventor
邓怡云
朱勉春
戴宪华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201910246724.7A priority Critical patent/CN111755070A/en
Publication of CN111755070A publication Critical patent/CN111755070A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

To overcome the deficiencies of the prior art, the present invention aims to predict the function of CircRNA using the proposed cascade decision system in combination with the multi-classification model of the LightGBM method. The technical scheme adopted by the invention mainly comprises the following steps: (1) the CircRNA of the large data sample is entered in the form of a (. bed) file. (2) And mapping the CircRNA (.bed) file according to the related information such as the initial site and the like to obtain a CircRNA sequence information (.fasta) file. (3) A method for extracting and fusing features is provided, and the CircRNA features are extracted. (4) A class A judgment system is provided for predicting the function of the coding type CircRNA. (5) Other circrnas were predicted using the LightGBM algorithm. (6) According to the multi-classification model of the lightGBM algorithm, core algorithms GOSS and EFB in the multi-classification model are utilized to respectively carry out sampling and feature sampling of sample data, continuous features are mapped into discrete buckets by a Histogram-based algorithm, and continuous variables are discretized. (7) And obtaining the optimal parameters of the model by adjusting parameters such as the maximum depth of the tree, the minimum leaf record number, the data proportion used in each iteration and the like.

Description

Cascade decision system-based CircRNA function prediction method
Technical Field
The invention relates to the technical field of bioinformatics, in particular to the field of function prediction of CircRNA.
Background
The CircRNA has multiple functions in biology, such as rich miRNA binding sites and the function of a cavernous body in cells; modulating the activity of the protein by binding to the protein; some circrnas can even be translated into proteins. It has also become an important potential biomarker in recent years. To obtain the specific function of the newly found CircRNA expressed in the organism, a large number of experiments are needed to identify the functions of the current CircRNA one by one, so as to obtain the final function result. The experimental methods with higher credibility consume too much time and equipment cost, and are not beneficial to identifying the function of the circRNA in large batch. The important role of the specific function of certain CircRNA in clinical medicine cannot be continuously explored.
Disclosure of Invention
To overcome the deficiencies of the prior art, the present invention aims to predict the function of CircRNA using the proposed cascade decision system in combination with the multi-classification model of the LightGBM method. The method fully utilizes the big data information of the circRNA with various discovered function types, trains a model from a machine learning method, and utilizes the model to predict the function of the newly discovered circRNA only by simply and conveniently inputting the related sequence information of DNA or RNA needing function identification, and the accuracy is up to 85 percent or more after experimental verification, thereby greatly saving the economic cost of experimental time and equipment loss and achieving the effect of multiplying the result of half the effort on experimental projects.
The invention adopts the technical scheme for solving the problems that the method mainly comprises the following steps:
s1, inputting circRNA of a big data sample in a (. bed) file form, wherein the circRNA comprises a chromosome number, a sequence starting site and a sign chain marker.
S2, mapping the circRNA (. bed) file to a whole human genome (hg19 version) according to relevant information such as a starting site and the like. Specific circRNA sequence information (. fasta) files were obtained.
S3, a method for extracting and fusing features is provided, wherein different features of the circRNA expressing specific functions are extracted, and the features comprise connection number, RBP binding sites and miRNA binding sites.
S4, a cascade judgment system is provided: and the A-type decision system is used for predicting the function of the coding-type circRNA. The ORF length, ORF ratio and IRES were individually analyzed in S3. All that meet the conditions will be predicted to be "coding" circrnas.
S5, other circRNAs which are not predicted to be "coding" circRNAs through S4 are predicted for the next step through LightGBM algorithm.
And S6, according to the multi-classification model of the lightGBM algorithm, sampling and characteristic sampling of sample data are respectively carried out by using the core algorithms GOSS and EFB, so that the model learning rate is greatly reduced without losing the precision of the learner under the condition of not changing the data distribution. Meanwhile, mapping continuous features into discrete buckets by using a Histogram-based algorithm to form individual bins, and then establishing a Histogram by using the bins to discretize continuous variables.
And S7, finally, adjusting parameters such as the maximum depth max _ depth of the tree, the minimum record number min _ data _ in _ leaf possibly possessed by the leaf, the data proportion bagging _ fraction used in each iteration and the like to obtain the optimal parameters of the model.
Compared with the prior art, the invention has the beneficial effects that:
the algorithm used by the invention provides a multi-feature fusion algorithm through theoretical derivation, sample data is sampled by using GOSS in a lightGBM method, the EFB samples the feature quantity, and the optimal parameters of the model are obtained by adjusting the maximum depth of a tree, the minimum record number of leaves and the like.
The invention utilizes the circRNA sequence and the related information of the upper and lower streams of the sequence to extract a plurality of characteristics, and combines the plurality of characteristics by adopting a multi-characteristic fusion algorithm as the input of characteristic information.
The method provided by the invention can be applied to predicting the function of the newly found circRNA, has great improvement and improvement in the aspects of accuracy, calculation speed, algorithm stability and the like, and can be better suitable for the work of actual circRNA function prediction.
Drawings
FIG. 1 flow chart of the invention
FIG. 2A class cascading decision system
FIG. 3 LightGBM core Algorithm map
FIG. 4 is a tuning procedure for solving the problems of overfitting and the like for parameters of LightGBM
FIG. 5 optimal parameter confusion matrix map
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the following embodiments and accompanying drawings.
Referring to fig. 1, a flow chart of the circRNA function prediction method based on the cascade decision system and LightGBM in this embodiment is shown. The invention adopts the technical scheme for solving the problems that the method mainly comprises the following steps:
s1, inputting circRNA of a big data sample in a (. bed) file form, wherein the circRNA comprises a chromosome number, a sequence starting site and a sign chain marker.
S2, mapping the circRNA (. bed) file to a whole human genome (hg19 version) according to relevant information such as a starting site and the like. Specific circRNA sequence information (. fasta) files were obtained.
S3, a method for extracting and fusing features is provided, wherein different features of the circRNA expressing specific functions are extracted, and the features comprise connection number, RBP binding sites and miRNA binding sites.
S4, a cascade judgment system is provided: and the A-type decision system is used for predicting the function of the coding-type circRNA. The ORF length, ORF ratio and IRES were individually analyzed in S3. All that meet the conditions will be predicted to be "coding" circrnas.
S5, other circRNAs which are not predicted to be "coding" circRNAs through S4 are predicted for the next step through LightGBM algorithm.
And S6, according to the multi-classification model of the lightGBM algorithm, sampling and characteristic sampling of sample data are respectively carried out by using the core algorithms GOSS and EFB, so that the model learning rate is greatly reduced without losing the precision of the learner under the condition of not changing the data distribution. Meanwhile, mapping continuous features into discrete buckets by using a Histogram-based algorithm to form individual bins, and then establishing a Histogram by using the bins to discretize continuous variables.
And S7, finally, adjusting parameters such as the maximum depth max _ depth of the tree, the minimum record number min _ data _ in _ leaf possibly possessed by the leaf, the data proportion bagging _ fraction used in each iteration and the like to obtain the optimal parameters of the model.
FIG. 2 is a diagram of a class A cascade decision system, namely, prediction of "coding type" function of newly discovered CircRNA.
Referring to fig. 3, a flow chart of a core algorithm of the LightGBM is shown.
S1. the main algorithm of the lightGBM comprises GOSS and EFB, and the method reduces the data volume and the characteristic dimension and accelerates the calculation speed under the large sample data environment. The flow of the gos algorithm is as follows:
inputting: training data of a circRNA large sample, iteration step number d, sampling rate a (0< a <1) of large gradient data, sampling rate b (0< a <1) of small gradient data, selecting a loss function and a weak learner class;
the specific process comprises the following steps:
(1) and arranging the samples in a descending order according to the gradient absolute values of the samples.
(2) Selecting the first a x 100% of the samples of the sequencing result in the step (1) to generate a subset of large-gradient sample points;
(3) randomly selecting b (1-a) 100% sample points from the rest sample set (1-a) 100% samples to generate a small-gradient sample point set;
(4) merging the large gradient sample and the sampled small gradient sample;
(5) multiplying the small gradient sample by a weight coefficient;
(6) learning a new weak learner using said sampled samples;
(7) and (4) continuously repeating the steps (1) to (6) until a specified iteration number is reached or convergence is reached.
And (3) outputting: a well trained strong learner;
the GOSS algorithm adopts a data sampling mode, and the randomness increases the diversity of weak learners, so that the method is favorable for improving the generalization capability of a trained model.
S2.efb algorithm, first, input: a total feature set F processed by a feature fusion algorithm, a maximum conflict number K and a graph G;
the specific process comprises the following steps:
(1) constructing a graph consisting of edges with weights, wherein the weights correspond to the total conflict values among the features;
(2) arranging the features in descending order according to a non-zero value counting rule;
(3) all features in the ordered list are examined and assigned to an existing bundling (by control) with a small conflict, or updated and assigned to a new bundling.
And (3) outputting: and (4) a feature binding set bundles.
S3.histogram algorithm. The main process of the histogram algorithm is to discretize continuous feature values into k integers and construct a k-wide histogram. When traversing the data, the discretized values are used as indexes to accumulate statistics in the histogram. After data are traversed for one time, the histogram accumulates needed statistics, and finally, the optimal segmentation point is searched in a traversing mode according to the discrete value of the histogram. The Histogram algorithm has some advantages as follows:
(1) the computational load of the segmentation gain is reduced relative to other algorithms, such as the pre-sorted algorithm in xgboost.
(2) The training of the model is further accelerated by histogram subtraction.
See fig. 3C, which is the main characteristic of lightGBM:
s1.LightGBM grows trees by means of the leaf-wise strategy. That is, from all the leaves currently, the leaf with the largest splitting gain is selected to be split, and the like, the process is repeated. Compared with Level-wise, the Level-wise can reduce more errors and obtain better precision under the condition of the same splitting times. However, when the number of samples is not large enough, the leaf-wise may cause overfitting. Therefore, LightGBM may limit the depth of the tree with the parameter max _ depth to reduce the likelihood of over-fitting.
And S2, when the features are divided in parallel to achieve the steps of reducing the dimension of the features and accelerating the calculation speed, the LightGBM cannot vertically divide the sample data any more, namely, each Worker holds all the data. Each Worker knows how to partition the data. The main flow of feature parallelism in LightGBM is as follows:
(1) each Worker searches an optimal division point { characteristics, threshold } on a local characteristic set;
(2) performing communication integration of each division on the local access feature set to obtain an optimal division;
(3) an optimal partitioning strategy is implemented.
And S3, reducing the data parallel overhead by reducing the communication overhead in the data parallel process in the LightGBM: LightGBM integrates different characteristics of different Worker that do not overlap each other using Reduce scanner approach. Then Worker finds the best partition from the local integral histogram and synchronizes it to the global best partition.
Referring to fig. 5, a parameter tuning method when overfitting occurs to the lightGBM during the training process is disclosed. In training, machine learning presents some problems, and in order to obtain the best parameters and the best effect, namely, adjusting parameter variables for the model, the following adjusting method and steps aiming at specific problems are as follows:
s1, in the training process, in order to obtain a faster training speed, the following parameter variables are adjusted:
(1) using a bagging method by setting bagging _ fraction and bagging _ freq parameters;
(2) using the sub-samples of the feature by setting the feature _ fraction parameter;
(3) decrease max _ bin;
(4) and accelerating the loading of the data by using save _ bind in the later learning process.
S2, in order to obtain faster accuracy, adjusting the following parameter variables:
(1) use larger max _ bins and num _ iterations, num _ leaves;
(2) a smaller learning rate is used.
S3, when the overfitting occurs, in order to process the overfitting situation,
(1) use smaller max _ bin and num _ leaves;
(2) using the bagging by setting bagging _ fraction and bagging _ freq;
(3) using the feature sub-samples by setting feature _ fraction;
(4) use more training data;
(6) regularization is used using lambda _ l1, lambda _ l2, and min _ split _ gain;
(7) max _ depth is attempted to avoid generating an overly deep tree.
Referring to fig. 5, an optimal confusion matrix map obtained by selecting the optimal feature combination and the optimal parameters when classifying by LightGBM is shown.

Claims (2)

1. A CircRNA function prediction method based on a cascade decision system is characterized in that:
the method comprises a cascade decision system and a LightGBM algorithm, the existing function of the circRNA is subjected to classification prediction, a multi-classification model of the LightGBM algorithm is used for training a big data sample which is processed by a multi-feature fusion method and is verified by an original experiment, and the obtained model is convenient to be used for the function prediction of a new circRNA.
2. The cascade decision system-based CircRNA function prediction method of claim 1, comprising the following steps:
s1, inputting circRNA of a big data sample in a (. bed) file form, wherein the circRNA comprises a chromosome number, a sequence starting site and a sign chain marker.
S2, mapping the circRNA (. bed) file to a whole human genome (hg19 version) according to relevant information such as a starting site and the like. Specific circRNA sequence information (. fasta) files were obtained.
S3, a method for extracting and fusing features is provided, wherein different features of the circRNA expressing specific functions are extracted, and the features comprise connection number, RBP binding sites and miRNA binding sites.
S4, a cascade judgment system is provided: and the A-type decision system is used for predicting the function of the coding-type circRNA. The ORF length, ORF ratio and IRES were individually analyzed in S3. All that meet the conditions will be predicted to be "coding" circrnas.
S5, other circRNAs which are not predicted to be "coding" circRNAs through S4 are predicted for the next step through LightGBM algorithm.
And S6, according to the multi-classification model of the lightGBM algorithm, sampling and characteristic sampling of sample data are respectively carried out by using the core algorithms GOSS and EFB, so that the model learning rate is greatly reduced without losing the precision of the learner under the condition of not changing the data distribution. Meanwhile, mapping continuous features into discrete buckets by using a Histogram-based algorithm to form individual bins, and then establishing a Histogram by using the bins to discretize continuous variables.
And S7, finally, adjusting parameters such as the maximum depth max _ depth of the tree, the minimum record number min _ data _ in _ leaf possibly possessed by the leaf, the data proportion bagging _ fraction used in each iteration and the like to obtain the optimal parameters of the model.
CN201910246724.7A 2019-03-29 2019-03-29 Cascade decision system-based CircRNA function prediction method Pending CN111755070A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910246724.7A CN111755070A (en) 2019-03-29 2019-03-29 Cascade decision system-based CircRNA function prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910246724.7A CN111755070A (en) 2019-03-29 2019-03-29 Cascade decision system-based CircRNA function prediction method

Publications (1)

Publication Number Publication Date
CN111755070A true CN111755070A (en) 2020-10-09

Family

ID=72671199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910246724.7A Pending CN111755070A (en) 2019-03-29 2019-03-29 Cascade decision system-based CircRNA function prediction method

Country Status (1)

Country Link
CN (1) CN111755070A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344076A (en) * 2021-06-08 2021-09-03 汕头大学 Integrated learning-based circRNA-miRNA interaction relation prediction method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101065393A (en) * 2004-09-30 2007-10-31 优基谱 Drug screening and molecular diagnostic test for early detection of colorectal cancer: reagents, methods, and kits thereof
CN101076543A (en) * 2004-10-13 2007-11-21 阿布林克斯公司 Single domain camelide anti-amyloid beta antibodies and polypeptides comprising the same for the treatment and diagnosis of degenerative neural diseases such as alzheimer's disease
CN101890163A (en) * 2002-04-26 2010-11-24 艾博特生物技术有限公司 The purposes of TNF Alpha antibodies and another medicine
CN107256245A (en) * 2017-06-02 2017-10-17 河海大学 Improved and system of selection towards the off-line model that refuse messages are classified
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN109325844A (en) * 2018-06-25 2019-02-12 南京工业大学 Net under multidimensional data borrows borrower's credit assessment method
CN109409426A (en) * 2018-10-23 2019-03-01 冶金自动化研究设计院 A kind of extreme value gradient promotion logistic regression classification prediction technique

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101890163A (en) * 2002-04-26 2010-11-24 艾博特生物技术有限公司 The purposes of TNF Alpha antibodies and another medicine
CN101065393A (en) * 2004-09-30 2007-10-31 优基谱 Drug screening and molecular diagnostic test for early detection of colorectal cancer: reagents, methods, and kits thereof
CN101076543A (en) * 2004-10-13 2007-11-21 阿布林克斯公司 Single domain camelide anti-amyloid beta antibodies and polypeptides comprising the same for the treatment and diagnosis of degenerative neural diseases such as alzheimer's disease
CN107256245A (en) * 2017-06-02 2017-10-17 河海大学 Improved and system of selection towards the off-line model that refuse messages are classified
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN109325844A (en) * 2018-06-25 2019-02-12 南京工业大学 Net under multidimensional data borrows borrower's credit assessment method
CN109409426A (en) * 2018-10-23 2019-03-01 冶金自动化研究设计院 A kind of extreme value gradient promotion logistic regression classification prediction technique

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344076A (en) * 2021-06-08 2021-09-03 汕头大学 Integrated learning-based circRNA-miRNA interaction relation prediction method
CN113344076B (en) * 2021-06-08 2022-03-22 汕头大学 Integrated learning-based circRNA-miRNA interaction relation prediction method

Similar Documents

Publication Publication Date Title
CN108595913B (en) Supervised learning method for identifying mRNA and lncRNA
Qi et al. Random forest similarity for protein-protein interaction prediction from multiple sources
CN110633667B (en) Action prediction method based on multitask random forest
CN104966106A (en) Biological age step-by-step predication method based on support vector machine
CN115116624A (en) Drug sensitivity prediction method and device based on semi-supervised transfer learning
Zhang et al. Interpretable learning algorithm based on XGBoost for fault prediction in optical network
CN114974411A (en) Metagenome pathogenic microorganism genome database and construction method thereof
CN111755070A (en) Cascade decision system-based CircRNA function prediction method
CN109165696A (en) A kind of clustering method and electronic equipment
CN112530520A (en) CircRNA function prediction method based on scoring mechanism and LightGBM
Dong et al. scSemiAE: a deep model with semi-supervised learning for single-cell transcriptomics
CN112669905B (en) RNA sequence coding potential prediction method and system based on data enhancement
Rasheed et al. LSH-Div: Species diversity estimation using locality sensitive hashing
CN112086133A (en) Drug target feature learning method and device based on text implicit information
CN103559423A (en) Method and device for predicting methylation
CN110246541A (en) A kind of circRNA discrimination method based on LightGBM
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
CN114566215A (en) Double-end paired splice site prediction method
CN108595914A (en) One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method
CN110600080B (en) Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof
CN113053461A (en) Target-based gene cluster directional mining method
CN115547407B (en) lncRNA-protein interaction prediction method based on depth automatic encoder
CN116110493B (en) Data set construction method for G-quadruplex prediction model and prediction method thereof
CN114550817B (en) CTCF (CTCF-mediated chromatin loop) prediction method based on multiple characteristics
Bhat et al. OTU clustering: A window to analyse uncultured microbial world

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination