CN111755070A

CN111755070A - Cascade decision system-based CircRNA function prediction method

Info

Publication number: CN111755070A
Application number: CN201910246724.7A
Authority: CN
Inventors: 邓怡云; 朱勉春; 戴宪华
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-10-09

Abstract

To overcome the deficiencies of the prior art, the present invention aims to predict the function of CircRNA using the proposed cascade decision system in combination with the multi-classification model of the LightGBM method. The technical scheme adopted by the invention mainly comprises the following steps: (1) the CircRNA of the large data sample is entered in the form of a (. bed) file. (2) And mapping the CircRNA (.bed) file according to the related information such as the initial site and the like to obtain a CircRNA sequence information (.fasta) file. (3) A method for extracting and fusing features is provided, and the CircRNA features are extracted. (4) A class A judgment system is provided for predicting the function of the coding type CircRNA. (5) Other circrnas were predicted using the LightGBM algorithm. (6) According to the multi-classification model of the lightGBM algorithm, core algorithms GOSS and EFB in the multi-classification model are utilized to respectively carry out sampling and feature sampling of sample data, continuous features are mapped into discrete buckets by a Histogram-based algorithm, and continuous variables are discretized. (7) And obtaining the optimal parameters of the model by adjusting parameters such as the maximum depth of the tree, the minimum leaf record number, the data proportion used in each iteration and the like.

Description

Cascade decision system-based CircRNA function prediction method

Technical Field

The invention relates to the technical field of bioinformatics, in particular to the field of function prediction of CircRNA.

Background

The CircRNA has multiple functions in biology, such as rich miRNA binding sites and the function of a cavernous body in cells; modulating the activity of the protein by binding to the protein; some circrnas can even be translated into proteins. It has also become an important potential biomarker in recent years. To obtain the specific function of the newly found CircRNA expressed in the organism, a large number of experiments are needed to identify the functions of the current CircRNA one by one, so as to obtain the final function result. The experimental methods with higher credibility consume too much time and equipment cost, and are not beneficial to identifying the function of the circRNA in large batch. The important role of the specific function of certain CircRNA in clinical medicine cannot be continuously explored.

Disclosure of Invention

To overcome the deficiencies of the prior art, the present invention aims to predict the function of CircRNA using the proposed cascade decision system in combination with the multi-classification model of the LightGBM method. The method fully utilizes the big data information of the circRNA with various discovered function types, trains a model from a machine learning method, and utilizes the model to predict the function of the newly discovered circRNA only by simply and conveniently inputting the related sequence information of DNA or RNA needing function identification, and the accuracy is up to 85 percent or more after experimental verification, thereby greatly saving the economic cost of experimental time and equipment loss and achieving the effect of multiplying the result of half the effort on experimental projects.

The invention adopts the technical scheme for solving the problems that the method mainly comprises the following steps:

s1, inputting circRNA of a big data sample in a (. bed) file form, wherein the circRNA comprises a chromosome number, a sequence starting site and a sign chain marker.

S2, mapping the circRNA (. bed) file to a whole human genome (hg19 version) according to relevant information such as a starting site and the like. Specific circRNA sequence information (. fasta) files were obtained.

S3, a method for extracting and fusing features is provided, wherein different features of the circRNA expressing specific functions are extracted, and the features comprise connection number, RBP binding sites and miRNA binding sites.

S4, a cascade judgment system is provided: and the A-type decision system is used for predicting the function of the coding-type circRNA. The ORF length, ORF ratio and IRES were individually analyzed in S3. All that meet the conditions will be predicted to be "coding" circrnas.

S5, other circRNAs which are not predicted to be "coding" circRNAs through S4 are predicted for the next step through LightGBM algorithm.

And S6, according to the multi-classification model of the lightGBM algorithm, sampling and characteristic sampling of sample data are respectively carried out by using the core algorithms GOSS and EFB, so that the model learning rate is greatly reduced without losing the precision of the learner under the condition of not changing the data distribution. Meanwhile, mapping continuous features into discrete buckets by using a Histogram-based algorithm to form individual bins, and then establishing a Histogram by using the bins to discretize continuous variables.

And S7, finally, adjusting parameters such as the maximum depth max _ depth of the tree, the minimum record number min _ data _ in _ leaf possibly possessed by the leaf, the data proportion bagging _ fraction used in each iteration and the like to obtain the optimal parameters of the model.

Compared with the prior art, the invention has the beneficial effects that:

the algorithm used by the invention provides a multi-feature fusion algorithm through theoretical derivation, sample data is sampled by using GOSS in a lightGBM method, the EFB samples the feature quantity, and the optimal parameters of the model are obtained by adjusting the maximum depth of a tree, the minimum record number of leaves and the like.

The invention utilizes the circRNA sequence and the related information of the upper and lower streams of the sequence to extract a plurality of characteristics, and combines the plurality of characteristics by adopting a multi-characteristic fusion algorithm as the input of characteristic information.

The method provided by the invention can be applied to predicting the function of the newly found circRNA, has great improvement and improvement in the aspects of accuracy, calculation speed, algorithm stability and the like, and can be better suitable for the work of actual circRNA function prediction.

Drawings

FIG. 1 flow chart of the invention

FIG. 2A class cascading decision system

FIG. 3 LightGBM core Algorithm map

FIG. 4 is a tuning procedure for solving the problems of overfitting and the like for parameters of LightGBM

FIG. 5 optimal parameter confusion matrix map

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the following embodiments and accompanying drawings.

Referring to fig. 1, a flow chart of the circRNA function prediction method based on the cascade decision system and LightGBM in this embodiment is shown. The invention adopts the technical scheme for solving the problems that the method mainly comprises the following steps:

FIG. 2 is a diagram of a class A cascade decision system, namely, prediction of "coding type" function of newly discovered CircRNA.

Referring to fig. 3, a flow chart of a core algorithm of the LightGBM is shown.

S1. the main algorithm of the lightGBM comprises GOSS and EFB, and the method reduces the data volume and the characteristic dimension and accelerates the calculation speed under the large sample data environment. The flow of the gos algorithm is as follows:

inputting: training data of a circRNA large sample, iteration step number d, sampling rate a (0< a <1) of large gradient data, sampling rate b (0< a <1) of small gradient data, selecting a loss function and a weak learner class;

the specific process comprises the following steps:

(1) and arranging the samples in a descending order according to the gradient absolute values of the samples.

(2) Selecting the first a x 100% of the samples of the sequencing result in the step (1) to generate a subset of large-gradient sample points;

(3) randomly selecting b (1-a) 100% sample points from the rest sample set (1-a) 100% samples to generate a small-gradient sample point set;

(4) merging the large gradient sample and the sampled small gradient sample;

(5) multiplying the small gradient sample by a weight coefficient;

(6) learning a new weak learner using said sampled samples;

(7) and (4) continuously repeating the steps (1) to (6) until a specified iteration number is reached or convergence is reached.

And (3) outputting: a well trained strong learner;

the GOSS algorithm adopts a data sampling mode, and the randomness increases the diversity of weak learners, so that the method is favorable for improving the generalization capability of a trained model.

S2.efb algorithm, first, input: a total feature set F processed by a feature fusion algorithm, a maximum conflict number K and a graph G;

the specific process comprises the following steps:

(1) constructing a graph consisting of edges with weights, wherein the weights correspond to the total conflict values among the features;

(2) arranging the features in descending order according to a non-zero value counting rule;

(3) all features in the ordered list are examined and assigned to an existing bundling (by control) with a small conflict, or updated and assigned to a new bundling.

And (3) outputting: and (4) a feature binding set bundles.

S3.histogram algorithm. The main process of the histogram algorithm is to discretize continuous feature values into k integers and construct a k-wide histogram. When traversing the data, the discretized values are used as indexes to accumulate statistics in the histogram. After data are traversed for one time, the histogram accumulates needed statistics, and finally, the optimal segmentation point is searched in a traversing mode according to the discrete value of the histogram. The Histogram algorithm has some advantages as follows:

(1) the computational load of the segmentation gain is reduced relative to other algorithms, such as the pre-sorted algorithm in xgboost.

(2) The training of the model is further accelerated by histogram subtraction.

See fig. 3C, which is the main characteristic of lightGBM:

s1.LightGBM grows trees by means of the leaf-wise strategy. That is, from all the leaves currently, the leaf with the largest splitting gain is selected to be split, and the like, the process is repeated. Compared with Level-wise, the Level-wise can reduce more errors and obtain better precision under the condition of the same splitting times. However, when the number of samples is not large enough, the leaf-wise may cause overfitting. Therefore, LightGBM may limit the depth of the tree with the parameter max _ depth to reduce the likelihood of over-fitting.

And S2, when the features are divided in parallel to achieve the steps of reducing the dimension of the features and accelerating the calculation speed, the LightGBM cannot vertically divide the sample data any more, namely, each Worker holds all the data. Each Worker knows how to partition the data. The main flow of feature parallelism in LightGBM is as follows:

(1) each Worker searches an optimal division point { characteristics, threshold } on a local characteristic set;

(2) performing communication integration of each division on the local access feature set to obtain an optimal division;

(3) an optimal partitioning strategy is implemented.

And S3, reducing the data parallel overhead by reducing the communication overhead in the data parallel process in the LightGBM: LightGBM integrates different characteristics of different Worker that do not overlap each other using Reduce scanner approach. Then Worker finds the best partition from the local integral histogram and synchronizes it to the global best partition.

Referring to fig. 5, a parameter tuning method when overfitting occurs to the lightGBM during the training process is disclosed. In training, machine learning presents some problems, and in order to obtain the best parameters and the best effect, namely, adjusting parameter variables for the model, the following adjusting method and steps aiming at specific problems are as follows:

s1, in the training process, in order to obtain a faster training speed, the following parameter variables are adjusted:

(1) using a bagging method by setting bagging _ fraction and bagging _ freq parameters;

(2) using the sub-samples of the feature by setting the feature _ fraction parameter;

(3) decrease max _ bin;

(4) and accelerating the loading of the data by using save _ bind in the later learning process.

S2, in order to obtain faster accuracy, adjusting the following parameter variables:

(1) use larger max _ bins and num _ iterations, num _ leaves;

(2) a smaller learning rate is used.

S3, when the overfitting occurs, in order to process the overfitting situation,

(1) use smaller max _ bin and num _ leaves;

(2) using the bagging by setting bagging _ fraction and bagging _ freq;

(3) using the feature sub-samples by setting feature _ fraction;

(4) use more training data;

(6) regularization is used using lambda _ l1, lambda _ l2, and min _ split _ gain;

(7) max _ depth is attempted to avoid generating an overly deep tree.

Referring to fig. 5, an optimal confusion matrix map obtained by selecting the optimal feature combination and the optimal parameters when classifying by LightGBM is shown.

Claims

1. A CircRNA function prediction method based on a cascade decision system is characterized in that:

the method comprises a cascade decision system and a LightGBM algorithm, the existing function of the circRNA is subjected to classification prediction, a multi-classification model of the LightGBM algorithm is used for training a big data sample which is processed by a multi-feature fusion method and is verified by an original experiment, and the obtained model is convenient to be used for the function prediction of a new circRNA.

2. The cascade decision system-based CircRNA function prediction method of claim 1, comprising the following steps: