CN109409522B

CN109409522B - Biological network reasoning algorithm based on ensemble learning

Info

Publication number: CN109409522B
Application number: CN201810998945.5A
Authority: CN
Inventors: 张建明; 李文超; 张蔚; 张峰; 沈新新
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2022-04-12
Anticipated expiration: 2038-08-29
Also published as: CN109409522A

Abstract

The invention discloses a biological network reasoning algorithm based on ensemble learning. It includes: aiming at a biological network reasoning problem consisting of n genes, taking an n-gene-m time sequence microarray expression data set as a training sample, and calculating a training set corresponding to each gene; respectively calculating the feature importance scores of each regulation factor and target gene pair by adopting a random forest algorithm and a gradient lifting tree algorithm, and respectively giving a corresponding first feature importance scoring table and a corresponding second feature importance scoring table; calculating accuracy ACC of first feature importance rating table₁Accuracy ACC of the second feature importance rating table₂(ii) a Weighting and fusing the first characteristic importance degree scoring table and the second characteristic importance degree scoring table by adopting an E-alpha weighting rule to obtain a final total scoring table; and obtaining the final biological network structure according to the total scoring table. The invention effectively improves the accuracy and stability of biological network reasoning.

Description

Biological network reasoning algorithm based on ensemble learning

Technical Field

The invention relates to the technical field of biological network reasoning, in particular to a biological network reasoning algorithm based on ensemble learning.

Background

In system biology, biological network modeling is a subject of long-term attention, and comprises a gene network, a protein interaction network, a metabolic network, a signal transduction network and the like, and a mathematical model can be established to capture partial dynamic characteristics of the network under the condition of being supported by effective information. The structural reasoning is carried out aiming at the gene network, which is not only helpful for understanding the action mechanism and the genetic information flow of the gene, but also helpful for predicting the response of the specific gene network under the external action of drugs and the like, thereby promoting the research and development of new drugs. In addition, in synthetic biology and related gene therapy which are gradually emerging, the design and construction of synthetic gene networks also need the cooperation of accurate mathematical models, so the important performance of computational modeling is gradually highlighted, and structural reasoning is the important link of network modeling. However, biological network reasoning is difficult, for example, a gene network has thousands of nodes, a network topology is complex and model parameters are numerous, the problem of insufficient information restricting network reasoning is gradually relieved along with the development of omics data acquisition technology, particularly high-throughput technology, and the problem of information source is solved to a certain extent, but the accuracy and stability of network reasoning are still insufficient.

Disclosure of Invention

In order to solve the technical problems, the invention provides a biological network reasoning algorithm based on ensemble learning, which is characterized in that an expression data set is used as an information source when a biological network is inferred, a random forest algorithm and a gradient lifting tree algorithm are used for respectively scoring the feature importance degree, then a final scoring table and a network structure are obtained based on a fusion strategy E-alpha weighting fusion strategy of known information (priori), and the accuracy and the stability of network reasoning are improved.

In order to solve the problems, the invention adopts the following technical scheme:

the invention relates to a biological network reasoning algorithm based on ensemble learning, which comprises the following steps:

s1: aiming at a biological network reasoning problem consisting of n genes, taking an n-gene-m time sequence microarray expression data set as a training sample, and calculating a training set corresponding to each gene;

the method for calculating the training set corresponding to the jth gene comprises the following steps:

dividing the microarray expression data set by using the jth gene as a target gene to obtain the jth geneThe expression time sequence values of the genes are used as output, the expression time sequence values of all the other genes are used as input of the regulation factors, and an input sample is constructed

And outputting the samples

And obtain a training set

Wherein

Represents the expression value of the jth gene in the kth test, wherein j is more than or equal to 1 and less than or equal to n;

s2: calculating the feature importance score of each regulation factor and target gene pair by adopting a random forest algorithm, giving out a corresponding first feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the first feature importance scoring table;

calculating the feature importance score of each regulation factor and target gene pair by adopting a gradient lifting tree algorithm, giving out a corresponding second feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the second feature importance scoring table;

s3: based on the obtained first feature importance degree scoring table and the second feature importance degree scoring table, a threshold is selected according to the complexity of the biological network, directed regulation and control edges exceeding the threshold are reserved, the biological network is reconstructed, the directed regulation and control edges obtained through experimental verification in the biological network are selected as gold standards, and the accuracy ACC of the first feature importance degree scoring table is calculated₁Accuracy ACC of the second feature importance rating table₂；

S4: ACC accuracy according to first feature importance rating table₁Accuracy ACC of the second feature importance rating table₂Calculating the weight omega of the first feature importance degree scoring table by adopting an E-alpha weighting rule₁The weight omega of the second feature importance degree scoring table₂；

S5: weight ω according to first feature importance rating Table₁The weight omega of the second feature importance degree scoring table₂Weighting and fusing the first characteristic importance degree scoring table and the second characteristic importance degree scoring table to obtain a final total scoring table;

s6: and obtaining a final biological network structure according to the total scoring table, and calculating the accuracy of the inferred final biological network structure.

The algorithm combines a directed graph and a tree regression method to reason a gene network, when the directed graph is adopted to describe the gene network, nodes represent genes or transcription factors, and directed regulation edges between the nodes represent regulation relations.

Considering the topological characteristics of the gene network structure, such as sparsity and high cohesion, under the assumption of a linear model, the structural inference problem of the gene network can be solved by a regression method in machine learning, namely a typical linear regression method and a tree regression method. The tree regression method is an effective method by scoring the feature importance of regulatory factor and target gene pairs. In the tree regression method, a random forest algorithm belonging to the Bagging method tries to select an optimal feature subset, and has been applied to a plurality of gene regulation network modeling problems, for example, a network reasoning method based on random forests decomposes a network with p genes into p independent subproblems, and expression values of target genes are predicted by using expression data of regulatory factors. Although the random forest has a good effect, the tree regression itself has limitations, such as the depth of the tree and the acquisition of an optimal tree, and the result of the tree regression can be greatly influenced by small parameter disturbance, and in addition, the solution of the optimal tree is an NP-hard problem, the global optimality is not guaranteed, a local optimal solution can be obtained by using a heuristic search algorithm, but the calculation is relatively complex and the convergence time is relatively long.

The tree regression method can be divided into two categories, namely Boosting and Bagging methods according to the principle, wherein the Boosting method mainly focuses on reducing model deviation when feature selection is carried out, and the Bagging method reduces model variance. Because the two methods have different attention points, the results obtained by reasoning have certain difference. Various network reasoning methods can be selected, network reasoning structures obtained by adopting the same expression data set to perform feature selection are different, and the reasoning results are diversified, so that the possibility is provided for improving the accuracy of the final network structure. The motivation of the model fusion work extracts the results with correct reasoning or high credibility in various reasoning methods, and further constructs a network topology structure with high credibility. The results obtained by different methods are assumed to have certain complementary information, namely each method has part of unique information, so that the algorithm provides an E-alpha fusion rule to extract relatively accurate prediction edges from the respective methods and construct a more accurate and reliable model. The model fusion strategy can be divided into a voting method, an averaging method and a learning method. The main difference between the averaging and learning methods is in the choice of model weights, which uses a new learner and assigned labels to select the weights for each model, taking into account the variability in the confidence levels of the individual models. The difficulty is that the network reasoning or reconstruction problem lacks such labels and is generally considered to be an unsupervised problem. For the structural reasoning problem, apart from a completely unknown regulation network, a considerable part of the network belongs to a known part of the topological structure, namely the existence of a part of regulation edges is verified through experiments such as gene knockout and the like. The directional regulation edges obtained through experimental verification can be used as prior information and can be used for guiding the weight distribution problem in multi-model fusion.

Preferably, the accuracy ACC of the first feature importance degree score table is calculated in the step S3₁The formula of (1) is as follows:

ACC₁＝β₁×ar₁，

wherein, beta₁To an accuracy rateCorrection factor, ar₁For the accuracy of the regulation and control edges containing the known network information in the first feature importance degree scoring table, num (inference) represents the number of directed regulation and control edges for inference, num (exp) represents the number of directed regulation and control edges existing in experimental verification,

when the corresponding directed regulation edge is in the directed regulation edge sorting set E₁With known network ordering set S₁When the ranks of the event E are consistent, the event E is corresponding to₁＝＝S₁，I(E₁＝＝S₁) 1, otherwise I (E)₁＝＝S₁)＝0；

Calculating the accuracy ACC of the second feature importance degree scoring table in the step S3₂The formula of (1) is as follows:

ACC₂＝β₂×ar₂，

wherein, beta₂For accuracy correction factor, ar₂For the accuracy of the regulation and control edges containing the known network information in the second feature importance degree scoring table, num (inference) represents the number of directed regulation and control edges for inference, num (exp) represents the number of directed regulation and control edges existing in experimental verification,

when the corresponding directed regulation edge is in the directed regulation edge sorting set E₂With known network ordering set S₂When the ranks of the event E are consistent, the event E is corresponding to₂＝＝S₂，I(E₂＝＝S₂) 1, otherwise I (E)₂＝＝S₂)＝0。

Preferably, in step S4, the weight ω of the first feature importance degree score table is calculated₁The weight omega of the second feature importance degree scoring table₂The formula of (1) is as follows:

wherein alpha is a dynamic adjustment factor.

Preferably, the step S5 includes the steps of:

s51: normalizing the feature importance scores in the first feature importance score table, and normalizing the feature importance scores in the second feature importance score table;

s52: the weight omega of the first feature importance rating table₁Respectively multiplying the feature importance scores after normalization processing in the first feature importance score table to obtain a new first feature importance score table, and weighting omega of the second feature importance score table₂Respectively multiplying the feature importance scores after normalization processing in the second feature importance score table to obtain a new second feature importance score table;

s53: and combining the data in the new first characteristic importance degree scoring table and the new second characteristic importance degree scoring table to obtain a final total scoring table, and sorting the total scoring table in a descending order according to the characteristic importance degree scores.

Preferably, the step S6 includes the steps of:

obtaining a final biological network structure according to the total scoring table, drawing an ROC curve, and expressing the accuracy of the inferred final biological network structure by adopting the area AUC under the ROC curve;

wherein i represents the number of the directed regulation edges, L represents the total number of the directed regulation edges contained in the gold standard biological network, FP represents the occurrence frequency of false positive predictions, TP represents the occurrence frequency of true positive predictions, TN represents the occurrence frequency of true negatives, FN represents the occurrence frequency of false negatives, the directed regulation edges of the event description reasoning corresponding to the true positive TPs really exist in the gold standard biological network, and the events corresponding to the false negative FNs indicate that the regulation relation which the network of the reasoning does not have does not exist in the gold standard biological network.

The invention has the beneficial effects that: when the biological network is inferred, the expression data set is used as an information source, feature importance degree scoring is respectively carried out by utilizing a random forest algorithm and a gradient lifting tree algorithm, then a final scoring table and a network structure are obtained based on a fusion strategy E-alpha weighting fusion strategy of known information (priori), and the accuracy and the stability of network inference are improved.

Drawings

FIG. 1 is a flow diagram of the bio-network inference of the present invention;

FIG. 2 is an ROC curve of the structural inference of the DREAM4 data set size10-network1 provided by the embodiment;

FIG. 3 is an ROC curve of the data set size100-network2 structural inference of DREAM4 provided by the embodiment;

FIG. 4 is a ROC curve for structural reasoning of the net2 of the DREAM5 data set provided by the embodiment.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.

Example (b): the biological network reasoning algorithm based on ensemble learning of the embodiment, as shown in fig. 1, includes the following steps:

s1: aiming at a biological network reasoning problem consisting of n genes, taking an n-gene-m time sequence microarray expression data set as a training sample, calculating a training set corresponding to each gene by adopting a two-dimensional matrix form for the microarray expression data set;

dividing the microarray expression data set by using the jth gene as a target gene, outputting the expression time sequence value of the jth gene, inputting the expression time sequence values of all the other genes as the input of a regulatory factor, and constructing an input sample

And outputting the samples

And obtain a training set

Wherein

Expressing the expression value of the jth gene in the kth test, wherein j is more than or equal to 1 and less than or equal to N, and N is an integer;

the obtained scoring table has four columns which are respectively the name of a regulatory factor, the name of a target gene, the size of a characteristic score and a label, and the characteristic importance score exceeds a set threshold value to prove that a corresponding directed regulatory edge exists;

s3: based on the obtained first characteristic importance degree scoring table and the second characteristic importance degree scoring table, selecting a threshold according to the complexity of the biological network, reserving directional control edges exceeding the threshold, reconstructing the biological network, and selecting the biological network to obtain experimental verificationCalculating the accuracy ACC of the first feature importance degree scoring table by taking the directed regulation and control edge as a gold standard₁Accuracy ACC of the second feature importance rating table₂；

The gradient lifting tree algorithm uses a classification regression tree as a basis learner, and the parameters required to be set by the gradient lifting tree algorithm in this embodiment include the number of basis learners (n _ estimators), the learning rate (learning _ rate), the loss function (loss _ function), the maximum depth (max _ depth), and the sub-sampling rate (subsample). After the feature selection algorithm gives the score table, the network structure is basically determined after setting a proper threshold, so that the feature selection algorithm is equivalent to the network reasoning algorithm in the embodiment.

Calculating the accuracy ACC of the first feature importance degree score table in step S3₁The formula of (1) is as follows:

ACC₁＝β₁×ar₁，

wherein, beta₁For accuracy correction factor, ar₁For the accuracy of the directed regulation and control edges containing the known network information in the first feature importance degree scoring table, num (inference) represents the number of the directed regulation and control edges for inference, num (exp) represents the number of the directed regulation and control edges existing in experimental verification,

Calculating accuracy ACC of the second feature importance degree score table in step S3₂The formula of (1) is as follows:

ACC₂＝β₂×ar₂，

wherein, beta₂For accuracy correction factor, ar₂For the accuracy rate of the directed regulation and control edges containing the known network information in the second feature importance degree scoring table, num (inference) represents the quantity of the directed regulation and control edges for reasoning, num (exp) represents the quantity of the directed regulation and control edges existing in experimental verification,

In step S4, the weight ω of the first feature importance degree score table is calculated₁The weight omega of the second feature importance degree scoring table₂The formula of (1) is as follows:

wherein alpha is a dynamic adjustment factor. And finding the optimal alpha by an optimization method so that the accuracy of model training is highest.

Step S5 includes the following steps:

The final model structure can be determined according to the final total scoring table, and the accuracy of the reasoning result can be evaluated through a Precision-Recall (PR) curve and a subject test curve (ROC curve for short). The PR curve is composed of an accuracy P and a recall R, and in the classification problem, the accuracy concerns the selection of positive examples from positive examples and negative examples, while the recall concerns the selection of the proportion of the feature rate from positive examples, which are a pair of contradictory measures.

Step S6 includes the following steps:

wherein i represents the directed regulation and control edge number, the directed regulation and control edges represent the directed connection between genes, L represents the total quantity of the directed regulation and control edges contained in the gold standard biological network, FP represents the occurrence frequency of false positive prediction, TP represents the occurrence frequency of true positive prediction, TN represents the occurrence frequency of true negative, FN represents the occurrence frequency of false negative, the event corresponding to the true positive TP indicates that the directed regulation and control edge which is inferred really exists in the gold standard biological network, and the event corresponding to the false negative FN indicates that the regulation and control relationship which is not existed in the inferred network does not exist in the gold standard biological network.

The golden standard is the real structure of the biological network, because the training set in the standard test case is generated by performing dynamic simulation according to the known network structure, and the integrated learning algorithm is used for comparing the network reconstruction with the real network to judge which directed regulation edges belong to the correctly predicted class and which directed regulation edges belong to the incorrectly predicted class.

False positive FP and true positive TP are concepts in classification, and the higher the true positive rate is, the more accurate the prediction of the directed regulation edge is; the reconstruction of the biological network is to determine whether there is a link between nodes, and the labels label 1 ═ 1 or 1abel ═ 0 are only two cases, so it can be regarded as a two-classification problem.

In this embodiment, microarray time series expression data provided by the DREAM Challenge platform is selected to perform network inference, the format of the expression data is n genes × m observation points, the obtained network structure is compared with a gold standard network, the performance of the algorithm is evaluated by ROC and PR curves, and the method with the larger area under the line is considered to be more effective.

The virtual networks are respectively from the DREAM4 platform and the DREAM5 platform, the network in the DREAM4 is divided into two levels of size10 and size100, and the number of genes related to the DREAM5 is more than 1500. The gene network related to the DREAM4 is derived from a sub-network of a gene control network of prokaryotes such as escherichia coli and saccharomycetes, and the extracted sub-networks have topological rules such as sparsity, aggregations and the like in topology. The GeneNetweaver software can give corresponding time sequence expression data through mechanism simulation according to a known network structure, the simulation expression data are input to an ensemble learning method to be used as a training set to reason directed connection edges between nodes, the obtained network structure is compared with a real network structure, and related performance indexes are calculated.

Table 1 selects five networks of size10 in a DREAM4 data set to carry out network structure inference tests, utilizes time sequence expression data to reconstruct the regulation and control relationship of a gene network, contrasts and analyzes the accuracy of network topology obtained by a method (E-alpha) based on a gradient lifting tree (GBDT), a Random Forest (RF), average weighting (MEAN) and an E-alpha weighting rule, and the network inference accuracy is quantitatively measured by an AUC index of the area under ROC line.

Size10	Net1	Net2	Net3	Net4	Net5
						Edges	15	16	15	13	12
GBDT-AUC	0.669	0.707	0.598	0.392	0.652
						RF-AUC	0.628	0.722	0.627	0.263	0.705
MEAN-AUC	0.646	0.723	0.609	0.363	0.684
						E-alpha-AUC	0.675	0.726	0.627	0.395	0.705

TABLE 1

Table 1 is a comparison of AUC indicators inferred against DREAM4 size10 gene regulation network. As can be seen from the results of the calculation of AUC, the results of E-alpha are superior to the single method and the average fusion strategy because: and the weighting item of the E-alpha is subjected to exponential processing after the scoring result is multiplied by an alpha correction factor, and the alpha is used for searching the value with the largest AUC value in the optimizing process. If a certain process performs better, the results may be more biased towards a better performing process. The ROC curves corresponding to the MEAN weighting (MEAN) and E-alpha weighting rule method (E-alpha) are shown in FIG. 2.

Table 2 selects five networks of size100 in a DREAM4 data set to carry out a plurality of tests, utilizes time sequence expression data to reconstruct the regulation and control relationship of a gene network, and contrasts and analyzes the accuracy of network topology obtained based on a gradient lifting tree (GBDT), a Random Forest (RF), an average weighting (MEAN), a linear regression (TIGRES) and an E-alpha weighting rule method (E-alpha), wherein ROC curves corresponding to the average weighting (MEAN) and the E-alpha weighting rule method (E-alpha) are shown in figure 3.

Size100	Net1	Net2	Net3	Net4	Net5
						Edges	176	249	195	211	193
GBDT-AUC	0.754	0.698	0.762	0.823	0.780
						RF-AUC	0.759	0.733	0.767	0.798	0.795
MEAN-AUC	0.760	0.723	0.770	0.810	0.795
						TIGRESS	0.750	0.700	0.760	0.770	0.750
E-alpha-AUC	0.770	0.737	0.771	0.822	0.797

TABLE 2

Table 3 selects data sets net1 and net2 of DREAM5 to carry out network reasoning test, reconstructs regulation and control relation of gene network by using time sequence expression data, and contrasts and analyzes accuracy of network topology obtained based on a gradient lifting tree (GBDT), a Random Forest (RF), average weighting (MEAN), linear regression (TIGERSS) and an E-alpha weighting rule method (E-alpha).

TABLE 3

The net2 in the DREAM5 data set is selected to carry out network reasoning test, and ROC curves corresponding to the average weighting (MEAN) and E-alpha weighting rule method (E-alpha) are shown in figure 4.

The results of the above tables 1, 2 and 3 show that the accuracy of the E-alpha strategy is slightly higher than the average fusion strategy, and although the accuracy of the E-alpha weighting rule is not as good as that of the gradient lifting tree in the size100-net4, the accuracy is better than that of a single structure inference algorithm and an average weighting method on the whole.

Further analysis of the E-alpha weighting rule reveals that the accuracy of the final model after the difference weighted combination of GBDT-AUC and RF-AUC has an influence. This is because: the E-alpha algorithm is characterized in that an alpha coefficient is added in front of score and an exponential operation is performed. The larger the value of alpha, the larger the difference in value in the exponential operation (exp), and the larger the difference in value after exp operation. Taking the DREAM5 net2 as an example, alpha obtained by searching is 0.01, exp (alpha × score) is basically equal after operation, and therefore the weight of the two learners after normalization calculation is about equal to 0.5. In the case of DREAM5 net1, the alpha best found by the optimization was 300, and exp (alpha × score) was calculated to have a greater tendency for the algorithm to score high. Therefore, the E-alpha weighting rule can make a balance between the optimal value best and the average mean, and the model weight is reasonably distributed, so that the final network reasoning accuracy is highest.

Claims

1. A biological network reasoning algorithm based on ensemble learning is characterized by comprising the following steps:

And outputting the samples

And obtain a training set

Wherein

S4: ACC accuracy according to first feature importance rating table₁Accuracy ACC of the second feature importance rating table₂Calculating the weight omega of the first feature importance degree scoring table by adopting an E-alpha weighting rule₁The weight omega of the second feature importance degree scoring table₂(ii) a And calculating the weight omega of the first feature importance degree scoring table₁The weight omega of the second feature importance degree scoring table₂The formula of (1) is as follows:

wherein, alpha is a dynamic adjustment factor;

2. The ensemble learning-based bio-network inference algorithm according to claim 1, wherein the accuracy ACC of the first feature importance score table is calculated in step S3₁The formula of (1) is as follows:

ACC₁＝β₁×ar₁，

wherein, beta₁For accuracy correction factor, ar₁For the accuracy of the regulation and control edges containing the known network information in the first feature importance degree scoring table, num (inference) represents the number of directed regulation and control edges for inference, num (exp) represents the number of directed regulation and control edges existing in experimental verification,

ACC₂＝β₂×ar₂，

3. The ensemble learning-based bio-network inference algorithm according to claim 1 or 2, wherein the step S5 comprises the steps of:

4. The ensemble learning-based bio-network inference algorithm according to claim 1 or 2, wherein the step S6 comprises the steps of: