CN109409522B - Biological network reasoning algorithm based on ensemble learning - Google Patents

Biological network reasoning algorithm based on ensemble learning Download PDF

Info

Publication number
CN109409522B
CN109409522B CN201810998945.5A CN201810998945A CN109409522B CN 109409522 B CN109409522 B CN 109409522B CN 201810998945 A CN201810998945 A CN 201810998945A CN 109409522 B CN109409522 B CN 109409522B
Authority
CN
China
Prior art keywords
feature importance
scoring table
network
regulation
accuracy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810998945.5A
Other languages
Chinese (zh)
Other versions
CN109409522A (en
Inventor
张建明
李文超
张蔚
张峰
沈新新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810998945.5A priority Critical patent/CN109409522B/en
Publication of CN109409522A publication Critical patent/CN109409522A/en
Application granted granted Critical
Publication of CN109409522B publication Critical patent/CN109409522B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/045Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/002Biomolecular computers, i.e. using biomolecules, proteins, cells

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a biological network reasoning algorithm based on ensemble learning. It includes: aiming at a biological network reasoning problem consisting of n genes, taking an n-gene-m time sequence microarray expression data set as a training sample, and calculating a training set corresponding to each gene; respectively calculating the feature importance scores of each regulation factor and target gene pair by adopting a random forest algorithm and a gradient lifting tree algorithm, and respectively giving a corresponding first feature importance scoring table and a corresponding second feature importance scoring table; calculating accuracy ACC of first feature importance rating table1Accuracy ACC of the second feature importance rating table2(ii) a Weighting and fusing the first characteristic importance degree scoring table and the second characteristic importance degree scoring table by adopting an E-alpha weighting rule to obtain a final total scoring table; and obtaining the final biological network structure according to the total scoring table. The invention effectively improves the accuracy and stability of biological network reasoning.

Description

Biological network reasoning algorithm based on ensemble learning
Technical Field
The invention relates to the technical field of biological network reasoning, in particular to a biological network reasoning algorithm based on ensemble learning.
Background
In system biology, biological network modeling is a subject of long-term attention, and comprises a gene network, a protein interaction network, a metabolic network, a signal transduction network and the like, and a mathematical model can be established to capture partial dynamic characteristics of the network under the condition of being supported by effective information. The structural reasoning is carried out aiming at the gene network, which is not only helpful for understanding the action mechanism and the genetic information flow of the gene, but also helpful for predicting the response of the specific gene network under the external action of drugs and the like, thereby promoting the research and development of new drugs. In addition, in synthetic biology and related gene therapy which are gradually emerging, the design and construction of synthetic gene networks also need the cooperation of accurate mathematical models, so the important performance of computational modeling is gradually highlighted, and structural reasoning is the important link of network modeling. However, biological network reasoning is difficult, for example, a gene network has thousands of nodes, a network topology is complex and model parameters are numerous, the problem of insufficient information restricting network reasoning is gradually relieved along with the development of omics data acquisition technology, particularly high-throughput technology, and the problem of information source is solved to a certain extent, but the accuracy and stability of network reasoning are still insufficient.
Disclosure of Invention
In order to solve the technical problems, the invention provides a biological network reasoning algorithm based on ensemble learning, which is characterized in that an expression data set is used as an information source when a biological network is inferred, a random forest algorithm and a gradient lifting tree algorithm are used for respectively scoring the feature importance degree, then a final scoring table and a network structure are obtained based on a fusion strategy E-alpha weighting fusion strategy of known information (priori), and the accuracy and the stability of network reasoning are improved.
In order to solve the problems, the invention adopts the following technical scheme:
the invention relates to a biological network reasoning algorithm based on ensemble learning, which comprises the following steps:
s1: aiming at a biological network reasoning problem consisting of n genes, taking an n-gene-m time sequence microarray expression data set as a training sample, and calculating a training set corresponding to each gene;
the method for calculating the training set corresponding to the jth gene comprises the following steps:
dividing the microarray expression data set by using the jth gene as a target gene to obtain the jth geneThe expression time sequence values of the genes are used as output, the expression time sequence values of all the other genes are used as input of the regulation factors, and an input sample is constructed
Figure BDA0001782025790000021
And outputting the samples
Figure BDA0001782025790000022
And obtain a training set
Figure BDA0001782025790000023
Wherein
Figure BDA0001782025790000024
Represents the expression value of the jth gene in the kth test, wherein j is more than or equal to 1 and less than or equal to n;
s2: calculating the feature importance score of each regulation factor and target gene pair by adopting a random forest algorithm, giving out a corresponding first feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the first feature importance scoring table;
calculating the feature importance score of each regulation factor and target gene pair by adopting a gradient lifting tree algorithm, giving out a corresponding second feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the second feature importance scoring table;
s3: based on the obtained first feature importance degree scoring table and the second feature importance degree scoring table, a threshold is selected according to the complexity of the biological network, directed regulation and control edges exceeding the threshold are reserved, the biological network is reconstructed, the directed regulation and control edges obtained through experimental verification in the biological network are selected as gold standards, and the accuracy ACC of the first feature importance degree scoring table is calculated1Accuracy ACC of the second feature importance rating table2
S4: ACC accuracy according to first feature importance rating table1Accuracy ACC of the second feature importance rating table2Calculating the weight omega of the first feature importance degree scoring table by adopting an E-alpha weighting rule1The weight omega of the second feature importance degree scoring table2
S5: weight ω according to first feature importance rating Table1The weight omega of the second feature importance degree scoring table2Weighting and fusing the first characteristic importance degree scoring table and the second characteristic importance degree scoring table to obtain a final total scoring table;
s6: and obtaining a final biological network structure according to the total scoring table, and calculating the accuracy of the inferred final biological network structure.
The algorithm combines a directed graph and a tree regression method to reason a gene network, when the directed graph is adopted to describe the gene network, nodes represent genes or transcription factors, and directed regulation edges between the nodes represent regulation relations.
Considering the topological characteristics of the gene network structure, such as sparsity and high cohesion, under the assumption of a linear model, the structural inference problem of the gene network can be solved by a regression method in machine learning, namely a typical linear regression method and a tree regression method. The tree regression method is an effective method by scoring the feature importance of regulatory factor and target gene pairs. In the tree regression method, a random forest algorithm belonging to the Bagging method tries to select an optimal feature subset, and has been applied to a plurality of gene regulation network modeling problems, for example, a network reasoning method based on random forests decomposes a network with p genes into p independent subproblems, and expression values of target genes are predicted by using expression data of regulatory factors. Although the random forest has a good effect, the tree regression itself has limitations, such as the depth of the tree and the acquisition of an optimal tree, and the result of the tree regression can be greatly influenced by small parameter disturbance, and in addition, the solution of the optimal tree is an NP-hard problem, the global optimality is not guaranteed, a local optimal solution can be obtained by using a heuristic search algorithm, but the calculation is relatively complex and the convergence time is relatively long.
The tree regression method can be divided into two categories, namely Boosting and Bagging methods according to the principle, wherein the Boosting method mainly focuses on reducing model deviation when feature selection is carried out, and the Bagging method reduces model variance. Because the two methods have different attention points, the results obtained by reasoning have certain difference. Various network reasoning methods can be selected, network reasoning structures obtained by adopting the same expression data set to perform feature selection are different, and the reasoning results are diversified, so that the possibility is provided for improving the accuracy of the final network structure. The motivation of the model fusion work extracts the results with correct reasoning or high credibility in various reasoning methods, and further constructs a network topology structure with high credibility. The results obtained by different methods are assumed to have certain complementary information, namely each method has part of unique information, so that the algorithm provides an E-alpha fusion rule to extract relatively accurate prediction edges from the respective methods and construct a more accurate and reliable model. The model fusion strategy can be divided into a voting method, an averaging method and a learning method. The main difference between the averaging and learning methods is in the choice of model weights, which uses a new learner and assigned labels to select the weights for each model, taking into account the variability in the confidence levels of the individual models. The difficulty is that the network reasoning or reconstruction problem lacks such labels and is generally considered to be an unsupervised problem. For the structural reasoning problem, apart from a completely unknown regulation network, a considerable part of the network belongs to a known part of the topological structure, namely the existence of a part of regulation edges is verified through experiments such as gene knockout and the like. The directional regulation edges obtained through experimental verification can be used as prior information and can be used for guiding the weight distribution problem in multi-model fusion.
Preferably, the accuracy ACC of the first feature importance degree score table is calculated in the step S31The formula of (1) is as follows:
ACC1=β1×ar1
Figure BDA0001782025790000051
Figure BDA0001782025790000052
wherein, beta1To an accuracy rateCorrection factor, ar1For the accuracy of the regulation and control edges containing the known network information in the first feature importance degree scoring table, num (inference) represents the number of directed regulation and control edges for inference, num (exp) represents the number of directed regulation and control edges existing in experimental verification,
when the corresponding directed regulation edge is in the directed regulation edge sorting set E1With known network ordering set S1When the ranks of the event E are consistent, the event E is corresponding to1==S1,I(E1==S1) 1, otherwise I (E)1==S1)=0;
Calculating the accuracy ACC of the second feature importance degree scoring table in the step S32The formula of (1) is as follows:
ACC2=β2×ar2
Figure BDA0001782025790000053
Figure BDA0001782025790000054
wherein, beta2For accuracy correction factor, ar2For the accuracy of the regulation and control edges containing the known network information in the second feature importance degree scoring table, num (inference) represents the number of directed regulation and control edges for inference, num (exp) represents the number of directed regulation and control edges existing in experimental verification,
when the corresponding directed regulation edge is in the directed regulation edge sorting set E2With known network ordering set S2When the ranks of the event E are consistent, the event E is corresponding to2==S2,I(E2==S2) 1, otherwise I (E)2==S2)=0。
Preferably, in step S4, the weight ω of the first feature importance degree score table is calculated1The weight omega of the second feature importance degree scoring table2The formula of (1) is as follows:
Figure BDA0001782025790000061
Figure BDA0001782025790000062
wherein alpha is a dynamic adjustment factor.
Preferably, the step S5 includes the steps of:
s51: normalizing the feature importance scores in the first feature importance score table, and normalizing the feature importance scores in the second feature importance score table;
s52: the weight omega of the first feature importance rating table1Respectively multiplying the feature importance scores after normalization processing in the first feature importance score table to obtain a new first feature importance score table, and weighting omega of the second feature importance score table2Respectively multiplying the feature importance scores after normalization processing in the second feature importance score table to obtain a new second feature importance score table;
s53: and combining the data in the new first characteristic importance degree scoring table and the new second characteristic importance degree scoring table to obtain a final total scoring table, and sorting the total scoring table in a descending order according to the characteristic importance degree scores.
Preferably, the step S6 includes the steps of:
obtaining a final biological network structure according to the total scoring table, drawing an ROC curve, and expressing the accuracy of the inferred final biological network structure by adopting the area AUC under the ROC curve;
Figure BDA0001782025790000063
Figure BDA0001782025790000071
Figure BDA0001782025790000072
wherein i represents the number of the directed regulation edges, L represents the total number of the directed regulation edges contained in the gold standard biological network, FP represents the occurrence frequency of false positive predictions, TP represents the occurrence frequency of true positive predictions, TN represents the occurrence frequency of true negatives, FN represents the occurrence frequency of false negatives, the directed regulation edges of the event description reasoning corresponding to the true positive TPs really exist in the gold standard biological network, and the events corresponding to the false negative FNs indicate that the regulation relation which the network of the reasoning does not have does not exist in the gold standard biological network.
The invention has the beneficial effects that: when the biological network is inferred, the expression data set is used as an information source, feature importance degree scoring is respectively carried out by utilizing a random forest algorithm and a gradient lifting tree algorithm, then a final scoring table and a network structure are obtained based on a fusion strategy E-alpha weighting fusion strategy of known information (priori), and the accuracy and the stability of network inference are improved.
Drawings
FIG. 1 is a flow diagram of the bio-network inference of the present invention;
FIG. 2 is an ROC curve of the structural inference of the DREAM4 data set size10-network1 provided by the embodiment;
FIG. 3 is an ROC curve of the data set size100-network2 structural inference of DREAM4 provided by the embodiment;
FIG. 4 is a ROC curve for structural reasoning of the net2 of the DREAM5 data set provided by the embodiment.
Detailed Description
The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.
Example (b): the biological network reasoning algorithm based on ensemble learning of the embodiment, as shown in fig. 1, includes the following steps:
s1: aiming at a biological network reasoning problem consisting of n genes, taking an n-gene-m time sequence microarray expression data set as a training sample, calculating a training set corresponding to each gene by adopting a two-dimensional matrix form for the microarray expression data set;
the method for calculating the training set corresponding to the jth gene comprises the following steps:
dividing the microarray expression data set by using the jth gene as a target gene, outputting the expression time sequence value of the jth gene, inputting the expression time sequence values of all the other genes as the input of a regulatory factor, and constructing an input sample
Figure BDA0001782025790000081
And outputting the samples
Figure BDA0001782025790000082
And obtain a training set
Figure BDA0001782025790000083
Wherein
Figure BDA0001782025790000084
Expressing the expression value of the jth gene in the kth test, wherein j is more than or equal to 1 and less than or equal to N, and N is an integer;
s2: calculating the feature importance score of each regulation factor and target gene pair by adopting a random forest algorithm, giving out a corresponding first feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the first feature importance scoring table;
calculating the feature importance score of each regulation factor and target gene pair by adopting a gradient lifting tree algorithm, giving out a corresponding second feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the second feature importance scoring table;
the obtained scoring table has four columns which are respectively the name of a regulatory factor, the name of a target gene, the size of a characteristic score and a label, and the characteristic importance score exceeds a set threshold value to prove that a corresponding directed regulatory edge exists;
s3: based on the obtained first characteristic importance degree scoring table and the second characteristic importance degree scoring table, selecting a threshold according to the complexity of the biological network, reserving directional control edges exceeding the threshold, reconstructing the biological network, and selecting the biological network to obtain experimental verificationCalculating the accuracy ACC of the first feature importance degree scoring table by taking the directed regulation and control edge as a gold standard1Accuracy ACC of the second feature importance rating table2
S4: ACC accuracy according to first feature importance rating table1Accuracy ACC of the second feature importance rating table2Calculating the weight omega of the first feature importance degree scoring table by adopting an E-alpha weighting rule1The weight omega of the second feature importance degree scoring table2
S5: weight ω according to first feature importance rating Table1The weight omega of the second feature importance degree scoring table2Weighting and fusing the first characteristic importance degree scoring table and the second characteristic importance degree scoring table to obtain a final total scoring table;
s6: and obtaining a final biological network structure according to the total scoring table, and calculating the accuracy of the inferred final biological network structure.
The gradient lifting tree algorithm uses a classification regression tree as a basis learner, and the parameters required to be set by the gradient lifting tree algorithm in this embodiment include the number of basis learners (n _ estimators), the learning rate (learning _ rate), the loss function (loss _ function), the maximum depth (max _ depth), and the sub-sampling rate (subsample). After the feature selection algorithm gives the score table, the network structure is basically determined after setting a proper threshold, so that the feature selection algorithm is equivalent to the network reasoning algorithm in the embodiment.
Calculating the accuracy ACC of the first feature importance degree score table in step S31The formula of (1) is as follows:
ACC1=β1×ar1
Figure BDA0001782025790000101
Figure BDA0001782025790000102
wherein, beta1For accuracy correction factor, ar1For the accuracy of the directed regulation and control edges containing the known network information in the first feature importance degree scoring table, num (inference) represents the number of the directed regulation and control edges for inference, num (exp) represents the number of the directed regulation and control edges existing in experimental verification,
when the corresponding directed regulation edge is in the directed regulation edge sorting set E1With known network ordering set S1When the ranks of the event E are consistent, the event E is corresponding to1==S1,I(E1==S1) 1, otherwise I (E)1==S1)=0;
Calculating accuracy ACC of the second feature importance degree score table in step S32The formula of (1) is as follows:
ACC2=β2×ar2
Figure BDA0001782025790000103
Figure BDA0001782025790000104
wherein, beta2For accuracy correction factor, ar2For the accuracy rate of the directed regulation and control edges containing the known network information in the second feature importance degree scoring table, num (inference) represents the quantity of the directed regulation and control edges for reasoning, num (exp) represents the quantity of the directed regulation and control edges existing in experimental verification,
when the corresponding directed regulation edge is in the directed regulation edge sorting set E2With known network ordering set S2When the ranks of the event E are consistent, the event E is corresponding to2==S2,I(E2==S2) 1, otherwise I (E)2==S2)=0。
In step S4, the weight ω of the first feature importance degree score table is calculated1The weight omega of the second feature importance degree scoring table2The formula of (1) is as follows:
Figure BDA0001782025790000105
Figure BDA0001782025790000111
wherein alpha is a dynamic adjustment factor. And finding the optimal alpha by an optimization method so that the accuracy of model training is highest.
Step S5 includes the following steps:
s51: normalizing the feature importance scores in the first feature importance score table, and normalizing the feature importance scores in the second feature importance score table;
s52: the weight omega of the first feature importance rating table1Respectively multiplying the feature importance scores after normalization processing in the first feature importance score table to obtain a new first feature importance score table, and weighting omega of the second feature importance score table2Respectively multiplying the feature importance scores after normalization processing in the second feature importance score table to obtain a new second feature importance score table;
s53: and combining the data in the new first characteristic importance degree scoring table and the new second characteristic importance degree scoring table to obtain a final total scoring table, and sorting the total scoring table in a descending order according to the characteristic importance degree scores.
The final model structure can be determined according to the final total scoring table, and the accuracy of the reasoning result can be evaluated through a Precision-Recall (PR) curve and a subject test curve (ROC curve for short). The PR curve is composed of an accuracy P and a recall R, and in the classification problem, the accuracy concerns the selection of positive examples from positive examples and negative examples, while the recall concerns the selection of the proportion of the feature rate from positive examples, which are a pair of contradictory measures.
Step S6 includes the following steps:
obtaining a final biological network structure according to the total scoring table, drawing an ROC curve, and expressing the accuracy of the inferred final biological network structure by adopting the area AUC under the ROC curve;
Figure BDA0001782025790000121
Figure BDA0001782025790000122
Figure BDA0001782025790000123
wherein i represents the directed regulation and control edge number, the directed regulation and control edges represent the directed connection between genes, L represents the total quantity of the directed regulation and control edges contained in the gold standard biological network, FP represents the occurrence frequency of false positive prediction, TP represents the occurrence frequency of true positive prediction, TN represents the occurrence frequency of true negative, FN represents the occurrence frequency of false negative, the event corresponding to the true positive TP indicates that the directed regulation and control edge which is inferred really exists in the gold standard biological network, and the event corresponding to the false negative FN indicates that the regulation and control relationship which is not existed in the inferred network does not exist in the gold standard biological network.
The golden standard is the real structure of the biological network, because the training set in the standard test case is generated by performing dynamic simulation according to the known network structure, and the integrated learning algorithm is used for comparing the network reconstruction with the real network to judge which directed regulation edges belong to the correctly predicted class and which directed regulation edges belong to the incorrectly predicted class.
False positive FP and true positive TP are concepts in classification, and the higher the true positive rate is, the more accurate the prediction of the directed regulation edge is; the reconstruction of the biological network is to determine whether there is a link between nodes, and the labels label 1 ═ 1 or 1abel ═ 0 are only two cases, so it can be regarded as a two-classification problem.
The algorithm combines a directed graph and a tree regression method to reason a gene network, when the directed graph is adopted to describe the gene network, nodes represent genes or transcription factors, and directed regulation edges between the nodes represent regulation relations.
Considering the topological characteristics of the gene network structure, such as sparsity and high cohesion, under the assumption of a linear model, the structural inference problem of the gene network can be solved by a regression method in machine learning, namely a typical linear regression method and a tree regression method. The tree regression method is an effective method by scoring the feature importance of regulatory factor and target gene pairs. In the tree regression method, a random forest algorithm belonging to the Bagging method tries to select an optimal feature subset, and has been applied to a plurality of gene regulation network modeling problems, for example, a network reasoning method based on random forests decomposes a network with p genes into p independent subproblems, and expression values of target genes are predicted by using expression data of regulatory factors. Although the random forest has a good effect, the tree regression itself has limitations, such as the depth of the tree and the acquisition of an optimal tree, and the result of the tree regression can be greatly influenced by small parameter disturbance, and in addition, the solution of the optimal tree is an NP-hard problem, the global optimality is not guaranteed, a local optimal solution can be obtained by using a heuristic search algorithm, but the calculation is relatively complex and the convergence time is relatively long.
The tree regression method can be divided into two categories, namely Boosting and Bagging methods according to the principle, wherein the Boosting method mainly focuses on reducing model deviation when feature selection is carried out, and the Bagging method reduces model variance. Because the two methods have different attention points, the results obtained by reasoning have certain difference. Various network reasoning methods can be selected, network reasoning structures obtained by adopting the same expression data set to perform feature selection are different, and the reasoning results are diversified, so that the possibility is provided for improving the accuracy of the final network structure. The motivation of the model fusion work extracts the results with correct reasoning or high credibility in various reasoning methods, and further constructs a network topology structure with high credibility. The results obtained by different methods are assumed to have certain complementary information, namely each method has part of unique information, so that the algorithm provides an E-alpha fusion rule to extract relatively accurate prediction edges from the respective methods and construct a more accurate and reliable model. The model fusion strategy can be divided into a voting method, an averaging method and a learning method. The main difference between the averaging and learning methods is in the choice of model weights, which uses a new learner and assigned labels to select the weights for each model, taking into account the variability in the confidence levels of the individual models. The difficulty is that the network reasoning or reconstruction problem lacks such labels and is generally considered to be an unsupervised problem. For the structural reasoning problem, apart from a completely unknown regulation network, a considerable part of the network belongs to a known part of the topological structure, namely the existence of a part of regulation edges is verified through experiments such as gene knockout and the like. The directional regulation edges obtained through experimental verification can be used as prior information and can be used for guiding the weight distribution problem in multi-model fusion.
In this embodiment, microarray time series expression data provided by the DREAM Challenge platform is selected to perform network inference, the format of the expression data is n genes × m observation points, the obtained network structure is compared with a gold standard network, the performance of the algorithm is evaluated by ROC and PR curves, and the method with the larger area under the line is considered to be more effective.
The virtual networks are respectively from the DREAM4 platform and the DREAM5 platform, the network in the DREAM4 is divided into two levels of size10 and size100, and the number of genes related to the DREAM5 is more than 1500. The gene network related to the DREAM4 is derived from a sub-network of a gene control network of prokaryotes such as escherichia coli and saccharomycetes, and the extracted sub-networks have topological rules such as sparsity, aggregations and the like in topology. The GeneNetweaver software can give corresponding time sequence expression data through mechanism simulation according to a known network structure, the simulation expression data are input to an ensemble learning method to be used as a training set to reason directed connection edges between nodes, the obtained network structure is compared with a real network structure, and related performance indexes are calculated.
Table 1 selects five networks of size10 in a DREAM4 data set to carry out network structure inference tests, utilizes time sequence expression data to reconstruct the regulation and control relationship of a gene network, contrasts and analyzes the accuracy of network topology obtained by a method (E-alpha) based on a gradient lifting tree (GBDT), a Random Forest (RF), average weighting (MEAN) and an E-alpha weighting rule, and the network inference accuracy is quantitatively measured by an AUC index of the area under ROC line.
Size10 Net1 Net2 Net3 Net4 Net5
Edges 15 16 15 13 12
GBDT-AUC 0.669 0.707 0.598 0.392 0.652
RF-AUC 0.628 0.722 0.627 0.263 0.705
MEAN-AUC 0.646 0.723 0.609 0.363 0.684
E-alpha-AUC 0.675 0.726 0.627 0.395 0.705
TABLE 1
Table 1 is a comparison of AUC indicators inferred against DREAM4 size10 gene regulation network. As can be seen from the results of the calculation of AUC, the results of E-alpha are superior to the single method and the average fusion strategy because: and the weighting item of the E-alpha is subjected to exponential processing after the scoring result is multiplied by an alpha correction factor, and the alpha is used for searching the value with the largest AUC value in the optimizing process. If a certain process performs better, the results may be more biased towards a better performing process. The ROC curves corresponding to the MEAN weighting (MEAN) and E-alpha weighting rule method (E-alpha) are shown in FIG. 2.
Table 2 selects five networks of size100 in a DREAM4 data set to carry out a plurality of tests, utilizes time sequence expression data to reconstruct the regulation and control relationship of a gene network, and contrasts and analyzes the accuracy of network topology obtained based on a gradient lifting tree (GBDT), a Random Forest (RF), an average weighting (MEAN), a linear regression (TIGRES) and an E-alpha weighting rule method (E-alpha), wherein ROC curves corresponding to the average weighting (MEAN) and the E-alpha weighting rule method (E-alpha) are shown in figure 3.
Size100 Net1 Net2 Net3 Net4 Net5
Edges 176 249 195 211 193
GBDT-AUC 0.754 0.698 0.762 0.823 0.780
RF-AUC 0.759 0.733 0.767 0.798 0.795
MEAN-AUC 0.760 0.723 0.770 0.810 0.795
TIGRESS 0.750 0.700 0.760 0.770 0.750
E-alpha-AUC 0.770 0.737 0.771 0.822 0.797
TABLE 2
Table 3 selects data sets net1 and net2 of DREAM5 to carry out network reasoning test, reconstructs regulation and control relation of gene network by using time sequence expression data, and contrasts and analyzes accuracy of network topology obtained based on a gradient lifting tree (GBDT), a Random Forest (RF), average weighting (MEAN), linear regression (TIGERSS) and an E-alpha weighting rule method (E-alpha).
Figure BDA0001782025790000161
TABLE 3
The net2 in the DREAM5 data set is selected to carry out network reasoning test, and ROC curves corresponding to the average weighting (MEAN) and E-alpha weighting rule method (E-alpha) are shown in figure 4.
The results of the above tables 1, 2 and 3 show that the accuracy of the E-alpha strategy is slightly higher than the average fusion strategy, and although the accuracy of the E-alpha weighting rule is not as good as that of the gradient lifting tree in the size100-net4, the accuracy is better than that of a single structure inference algorithm and an average weighting method on the whole.
Further analysis of the E-alpha weighting rule reveals that the accuracy of the final model after the difference weighted combination of GBDT-AUC and RF-AUC has an influence. This is because: the E-alpha algorithm is characterized in that an alpha coefficient is added in front of score and an exponential operation is performed. The larger the value of alpha, the larger the difference in value in the exponential operation (exp), and the larger the difference in value after exp operation. Taking the DREAM5 net2 as an example, alpha obtained by searching is 0.01, exp (alpha × score) is basically equal after operation, and therefore the weight of the two learners after normalization calculation is about equal to 0.5. In the case of DREAM5 net1, the alpha best found by the optimization was 300, and exp (alpha × score) was calculated to have a greater tendency for the algorithm to score high. Therefore, the E-alpha weighting rule can make a balance between the optimal value best and the average mean, and the model weight is reasonably distributed, so that the final network reasoning accuracy is highest.

Claims (4)

1. A biological network reasoning algorithm based on ensemble learning is characterized by comprising the following steps:
s1: aiming at a biological network reasoning problem consisting of n genes, taking an n-gene-m time sequence microarray expression data set as a training sample, and calculating a training set corresponding to each gene;
the method for calculating the training set corresponding to the jth gene comprises the following steps:
dividing the microarray expression data set by using the jth gene as a target gene, outputting the expression time sequence value of the jth gene, inputting the expression time sequence values of all the other genes as the input of a regulatory factor, and constructing an input sample
Figure FDA0003502229300000011
And outputting the samples
Figure FDA0003502229300000012
And obtain a training set
Figure FDA0003502229300000013
Wherein
Figure FDA0003502229300000014
Represents the expression value of the jth gene in the kth test, wherein j is more than or equal to 1 and less than or equal to n;
s2: calculating the feature importance score of each regulation factor and target gene pair by adopting a random forest algorithm, giving out a corresponding first feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the first feature importance scoring table;
calculating the feature importance score of each regulation factor and target gene pair by adopting a gradient lifting tree algorithm, giving out a corresponding second feature importance scoring table, and arranging the feature importance scores in a descending order, wherein the directed regulation edge with large score is more front in the ranking of the second feature importance scoring table;
s3: based on the obtained first feature importance degree scoring table and the second feature importance degree scoring table, a threshold is selected according to the complexity of the biological network, directed regulation and control edges exceeding the threshold are reserved, the biological network is reconstructed, the directed regulation and control edges obtained through experimental verification in the biological network are selected as gold standards, and the accuracy ACC of the first feature importance degree scoring table is calculated1Accuracy ACC of the second feature importance rating table2
S4: ACC accuracy according to first feature importance rating table1Accuracy ACC of the second feature importance rating table2Calculating the weight omega of the first feature importance degree scoring table by adopting an E-alpha weighting rule1The weight omega of the second feature importance degree scoring table2(ii) a And calculating the weight omega of the first feature importance degree scoring table1The weight omega of the second feature importance degree scoring table2The formula of (1) is as follows:
Figure FDA0003502229300000021
Figure FDA0003502229300000022
wherein, alpha is a dynamic adjustment factor;
s5: weight ω according to first feature importance rating Table1The weight omega of the second feature importance degree scoring table2Weighting and fusing the first characteristic importance degree scoring table and the second characteristic importance degree scoring table to obtain a final total scoring table;
s6: and obtaining a final biological network structure according to the total scoring table, and calculating the accuracy of the inferred final biological network structure.
2. The ensemble learning-based bio-network inference algorithm according to claim 1, wherein the accuracy ACC of the first feature importance score table is calculated in step S31The formula of (1) is as follows:
ACC1=β1×ar1
Figure FDA0003502229300000023
Figure FDA0003502229300000024
wherein, beta1For accuracy correction factor, ar1For the accuracy of the regulation and control edges containing the known network information in the first feature importance degree scoring table, num (inference) represents the number of directed regulation and control edges for inference, num (exp) represents the number of directed regulation and control edges existing in experimental verification,
when the corresponding directed regulation edge is in the directed regulation edge sorting set E1With known network ordering set S1When the ranks of the event E are consistent, the event E is corresponding to1==S1,I(E1==S1) 1, otherwise I (E)1==S1)=0;
Calculating the accuracy ACC of the second feature importance degree scoring table in the step S32The formula of (1) is as follows:
ACC2=β2×ar2
Figure FDA0003502229300000031
Figure FDA0003502229300000032
wherein, beta2For accuracy correction factor, ar2For the accuracy of the regulation and control edges containing the known network information in the second feature importance degree scoring table, num (inference) represents the number of directed regulation and control edges for inference, num (exp) represents the number of directed regulation and control edges existing in experimental verification,
when the corresponding directed regulation edge is in the directed regulation edge sorting set E2With known network ordering set S2When the ranks of the event E are consistent, the event E is corresponding to2==S2,I(E2==S2) 1, otherwise I (E)2==S2)=0。
3. The ensemble learning-based bio-network inference algorithm according to claim 1 or 2, wherein the step S5 comprises the steps of:
s51: normalizing the feature importance scores in the first feature importance score table, and normalizing the feature importance scores in the second feature importance score table;
s52: the weight omega of the first feature importance rating table1Respectively multiplying the feature importance scores after normalization processing in the first feature importance score table to obtain a new first feature importance score table, and weighting omega of the second feature importance score table2Respectively multiplying the feature importance scores after normalization processing in the second feature importance score table to obtain a new second feature importance score table;
s53: and combining the data in the new first characteristic importance degree scoring table and the new second characteristic importance degree scoring table to obtain a final total scoring table, and sorting the total scoring table in a descending order according to the characteristic importance degree scores.
4. The ensemble learning-based bio-network inference algorithm according to claim 1 or 2, wherein the step S6 comprises the steps of:
obtaining a final biological network structure according to the total scoring table, drawing an ROC curve, and expressing the accuracy of the inferred final biological network structure by adopting the area AUC under the ROC curve;
Figure FDA0003502229300000041
Figure FDA0003502229300000042
Figure FDA0003502229300000043
wherein i represents the number of the directed regulation edges, L represents the total number of the directed regulation edges contained in the gold standard biological network, FP represents the occurrence frequency of false positive predictions, TP represents the occurrence frequency of true positive predictions, TN represents the occurrence frequency of true negatives, FN represents the occurrence frequency of false negatives, the directed regulation edges of the event description reasoning corresponding to the true positive TPs really exist in the gold standard biological network, and the events corresponding to the false negative FNs indicate that the regulation relation which the network of the reasoning does not have does not exist in the gold standard biological network.
CN201810998945.5A 2018-08-29 2018-08-29 Biological network reasoning algorithm based on ensemble learning Active CN109409522B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810998945.5A CN109409522B (en) 2018-08-29 2018-08-29 Biological network reasoning algorithm based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810998945.5A CN109409522B (en) 2018-08-29 2018-08-29 Biological network reasoning algorithm based on ensemble learning

Publications (2)

Publication Number Publication Date
CN109409522A CN109409522A (en) 2019-03-01
CN109409522B true CN109409522B (en) 2022-04-12

Family

ID=65463732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810998945.5A Active CN109409522B (en) 2018-08-29 2018-08-29 Biological network reasoning algorithm based on ensemble learning

Country Status (1)

Country Link
CN (1) CN109409522B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112151184B (en) * 2020-09-27 2021-05-07 东北林业大学 System for calculating disease similarity based on network representation learning
CN113066522B (en) * 2021-03-23 2022-07-12 浙江大学 Gene network reasoning method based on modular recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719194A (en) * 2009-12-03 2010-06-02 上海大学 Artificial gene regulatory network simulation method
WO2015173803A2 (en) * 2014-05-11 2015-11-19 Ofek - Eshkolot Research And Development Ltd A system and method for generating detection of hidden relatedness between proteins via a protein connectivity network
CN106875001A (en) * 2017-03-22 2017-06-20 浙江大学 Random drift particle group optimizing method with von Neumann structure
CN108182288A (en) * 2018-01-23 2018-06-19 南京航空航天大学 A kind of recommendation method based on artificial immunity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719194A (en) * 2009-12-03 2010-06-02 上海大学 Artificial gene regulatory network simulation method
WO2015173803A2 (en) * 2014-05-11 2015-11-19 Ofek - Eshkolot Research And Development Ltd A system and method for generating detection of hidden relatedness between proteins via a protein connectivity network
CN106875001A (en) * 2017-03-22 2017-06-20 浙江大学 Random drift particle group optimizing method with von Neumann structure
CN108182288A (en) * 2018-01-23 2018-06-19 南京航空航天大学 A kind of recommendation method based on artificial immunity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Data integration of hybrid microarray and single cell expression data to enhance gene network inference;W Zhang等;《Current Eiioinformatics》;20190401;第14卷(第3期);20190401 *
Learning Genetic Regulatory Network Connectivity from Time Series Data;Nathan A. Barker等;《IEEE/ACM Transactions on Computational Biology and Bioinformatics》;20090508;第8卷(第1期);152-165 *
基于特征选择和拓扑分析的基因调控网络重构研究;张峰;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20180815;A006-83 *

Also Published As

Publication number Publication date
CN109409522A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
Maraziotis A semi-supervised fuzzy clustering algorithm applied to gene expression data
KR101927910B1 (en) System and method for predicting disease inforamtion using deep neural network
CN107862179A (en) A kind of miRNA disease association Relationship Prediction methods decomposed based on similitude and logic matrix
EP4318478A1 (en) Prediction model training and data prediction methods and apparatuses, and storage medium
Örkcü et al. Estimating the parameters of 3-p Weibull distribution using particle swarm optimization: A comprehensive experimental comparison
CN110459264B (en) Method for predicting relevance of circular RNA and diseases based on gradient enhanced decision tree
CN105868775A (en) Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN109637579B (en) Tensor random walk-based key protein identification method
Tripoliti et al. Modifications of the construction and voting mechanisms of the random forests algorithm
CN109284860A (en) A kind of prediction technique based on orthogonal reversed cup ascidian optimization algorithm
CN110222838B (en) Document sorting method and device, electronic equipment and storage medium
CN111370073B (en) Medicine interaction rule prediction method based on deep learning
CN106055922A (en) Hybrid network gene screening method based on gene expression data
CN105046323B (en) Regularization-based RBF network multi-label classification method
CN109409522B (en) Biological network reasoning algorithm based on ensemble learning
CN110738362A (en) method for constructing prediction model based on improved multivariate cosmic algorithm
CN108491686A (en) A kind of gene regulatory network construction method based on two-way XGBoost
CN111062511B (en) Aquaculture disease prediction method and system based on decision tree and neural network
CN104714977B (en) A kind of correlating method and device of entity and knowledge library item
CN112215259A (en) Gene selection method and apparatus
CN107145934A (en) A kind of artificial bee colony optimization method based on enhancing local search ability
Hu Analysis and research on the integrated English teaching effectiveness of internet of things based on stochastic forest algorithm
CN116702132A (en) Network intrusion detection method and system
Bai et al. Recommendation algorithm based on probabilistic matrix factorization with adaboost
CN110993121A (en) Drug association prediction method based on double-cooperation linear manifold

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant