CN112951320B - Biomedical network association prediction method based on ensemble learning - Google Patents
Biomedical network association prediction method based on ensemble learning Download PDFInfo
- Publication number
- CN112951320B CN112951320B CN202110236007.3A CN202110236007A CN112951320B CN 112951320 B CN112951320 B CN 112951320B CN 202110236007 A CN202110236007 A CN 202110236007A CN 112951320 B CN112951320 B CN 112951320B
- Authority
- CN
- China
- Prior art keywords
- prediction
- matrix
- biomedical
- network association
- association
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 239000011159 matrix material Substances 0.000 claims abstract description 88
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 39
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 17
- 230000003044 adaptive effect Effects 0.000 claims description 14
- 239000013598 vector Substances 0.000 claims description 12
- 230000010354 integration Effects 0.000 claims description 11
- 238000005457 optimization Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 7
- 101100001674 Emericella variicolor andI gene Proteins 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 abstract description 3
- 201000010099 disease Diseases 0.000 description 15
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 15
- 238000011156 evaluation Methods 0.000 description 14
- 238000002474 experimental method Methods 0.000 description 14
- 102000004190 Enzymes Human genes 0.000 description 6
- 108090000790 Enzymes Proteins 0.000 description 6
- 239000003814 drug Substances 0.000 description 6
- 229940079593 drug Drugs 0.000 description 6
- 108020005198 Long Noncoding RNA Proteins 0.000 description 5
- 108090000623 proteins and genes Proteins 0.000 description 5
- 102000004169 proteins and genes Human genes 0.000 description 5
- 230000000875 corresponding effect Effects 0.000 description 4
- 239000003596 drug target Substances 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 108091046869 Telomeric non-coding RNA Proteins 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000009792 diffusion process Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000001276 controlling effect Effects 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 108091070501 miRNA Proteins 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 241000995051 Brenda Species 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000003831 deregulation Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000004064 dysfunction Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000010206 sensitivity analysis Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Analytical Chemistry (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Crystallography & Structural Chemistry (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
Aiming at the limitations of the prior art, the invention provides a biomedical network association prediction method based on ensemble learning, which introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.
Description
Technical Field
The invention relates to the technical field of computational biology, in particular to biological data mining, and more particularly relates to a biomedical network association prediction method based on ensemble learning.
Background
The development of complex diseases such as cancer often results not from deregulation and mutation of individual biomolecules but from dysfunctions of regulatory networks composed of interactions between biomolecules. In the course of disease occurrence and development, abnormal behaviors of some biomolecules occur, and the identification of abnormal biomolecules with high correlation with disease occurrence is very helpful for the prevention, diagnosis and treatment of diseases. In recent years, many studies have been made to verify the association between different types of biological entities, such as association of drugs with protein targets, association of diseases with micro RNAs, association of diseases with Long non-coding RNAs (lncrnas), and the like, through biological experiments. However, identifying new biomedical associations by biological experimentation requires a significant amount of time and is costly. In the computer field, such problems can be abstracted as associated predictive problems for a two-way network. The conceptual diagram of the biomedical binary network can be seen in fig. 1. Thus, predicting potential correlation networks by computational methods, thereby providing references and suggestions for biological experiments, will help to improve the efficiency of biomedical correlation recognition of this task and reduce costs.
In the last decade, various computing methods have emerged that are applied to biomedical network-related prediction tasks. According to the principle, three kinds of methods can be roughly classified: network diffusion model, feature-based classification method, and matrix decomposition-based method. The network diffusion model mainly uses a graph-based method to carry out diffusion propagation on the association in the biomedical network, so as to predict the potential association in the network. The feature-based classification method is to represent each association by the features of nodes of both the association parties, and then input the association into a machine learning model for training. Matrix decomposition-based methods attempt to learn two or more low-dimensional factor matrices from biomedical correlation matrices and then multiply them to reconstruct a correlation matrix. However, in the face of biomedical association networks of varying types, assumptions that rely solely on a single predictive approach may not accurately characterize all data.
Publication date 2020.04.10, publication number: chinese invention patent CN110993113 a: the method and the system for predicting the lncRNA-disease relationship based on MF-SDAE attempt to extract various characteristics of lncRNA and various characteristics of diseases by utilizing a plurality of lncRNA databases and a plurality of disease databases so as to provide a rapid and effective scheme, but the scheme still has certain limitations.
Disclosure of Invention
Aiming at the limitation of the prior art, the invention provides a biomedical network association prediction method based on ensemble learning, which adopts the following technical scheme:
a biomedical network association prediction method based on ensemble learning comprises the following steps:
s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; respectively applying a plurality of biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;
s2, respectively calculating and obtaining a prediction similarity matrix of the biological entity according to the prediction results of the algorithms; after the sparse processing is carried out on the prediction similarity matrix, the integrated similarity matrix of the biological entity is obtained by combining the prediction results of all algorithms and calculating in a weighted superposition mode;
s3, extracting low-dimensional features from the original similarity matrix by utilizing singular value decomposition, and constructing a self-adaptive weighted integrated matrix decomposition model by combining the integrated similarity matrix;
s4, training and optimizing the self-adaptive weighting integrated matrix decomposition model until the model converges;
s5, reconstructing a prediction matrix by using the converged self-adaptive weighting integrated matrix decomposition model to serve as a final result of the biological entity association prediction.
Compared with the prior art, the method introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.
As a preferred embodiment, in the step S2, the predicted similarity matrix of the biological entity is obtained by calculation according to the following formula and
Wherein a= { a 1 ,a 2 ,…,a m B= { B 1 ,b 2 ,…,b n Respectively representing the collection of biological entities; y is Y (l) (a i) and Y(l) (b i ) Respectively representing the predicted result Y of the first algorithm (l) Is the (a) th i Individual row vectors, b i A plurality of column vectors; the parameters controlling the bandwidth of the function are set as and
Further, in the step S2, the prediction similarity matrix is thinned by the following formula:
wherein ,N(ai ) Represents a i Neighbor set of (b), N (b) i ) Represents b i Is a neighbor set of the node (a).
Further, in said step S2 an integrated similarity matrix G of said biological entity is obtained by calculation according to the formula AS G BS :
Further, in the step S3, the original similarity matrix S is mapped by the following formula A and SB Extracting low-dimensional features F A and FB :
Wherein the dimension of the low-dimensional feature is set to f A (f A<m) and fB (f B <n)。
As a preferred scheme, the adaptive weighted integration matrix decomposition model is expressed by the following formula:
wherein ,GAS F A Form a characteristic representation of a class A biological entity, G BS F B A characteristic representation of the class B biological entity is constituted, andRepresenting A, B two classes of biological entities projected onto a shared k (k.ltoreq.min (f) A ,f B ) Embedding matrix in dimensional space, u i Represents the ith row vector of U, v j Represents the j-th row vector of V.
Further, in the step S4, training and optimizing the adaptive weighted integration matrix decomposition model is implemented by solving the following objective function:
wherein M is the number of biomedical network association prediction algorithms.
Further, in the process of each iteration update in the step S4, the optimization variables U, V are updated alternately in turn,
Further, in the step S4, the optimization variables U, V are updated alternately in sequence,The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows:
wherein, the ". Iy represents the Hadamard product operator between the matrices;
further, in the step S4, the convergence condition for training and optimizing the adaptive weighted integration matrix decomposition model is:
wherein ,L(k ) Representing the value of the objective function at the kth iteration.
Drawings
FIG. 1 is a conceptual diagram of a biomedical binary network;
FIG. 2 is a flowchart of steps of a biomedical network association prediction method based on ensemble learning according to an embodiment of the present invention;
FIG. 3 is a logic diagram of a biomedical network association prediction method based on ensemble learning according to an embodiment of the present invention;
FIG. 4 is a graph of AUC over lncRNADisease2015 dataset as a function of parameter k in an evaluation experiment in accordance with an embodiment of the present invention;
FIG. 5 is a graph of AUC over lncRNADisease2015 dataset as a function of parameter λ in an evaluation experiment in accordance with an embodiment of the present invention;
FIG. 6 is a graph showing the dependence of the parameter lambda on the lncRNADisease2015 dataset in an evaluation experiment according to an embodiment of the present invention w AUC curve of the variation of (a).
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the embodiments of the present application, are within the scope of the embodiments of the present application.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims. In the description of this application, it should be understood that the terms "first," "second," "third," and the like are used merely to distinguish between similar objects and are not necessarily used to describe a particular order or sequence, nor should they be construed to indicate or imply relative importance. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.
Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The invention is further illustrated in the following figures and examples.
In order to solve the limitations of the prior art, the present embodiment provides a technical solution, and the technical solution of the present invention is further described below with reference to the drawings and the embodiments.
Referring to fig. 2, the biomedical network association prediction method based on ensemble learning includes the following steps:
s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; respectively applying a plurality of biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;
s2, respectively calculating and obtaining a prediction similarity matrix of the biological entity according to the prediction results of the algorithms; after the sparse processing is carried out on the prediction similarity matrix, the integrated similarity matrix of the biological entity is obtained by combining the prediction results of all algorithms and calculating in a weighted superposition mode;
s3, extracting low-dimensional features from the original similarity matrix by utilizing singular value decomposition, and constructing a self-adaptive weighted integrated matrix decomposition model by combining the integrated similarity matrix;
s4, training and optimizing the self-adaptive weighting integrated matrix decomposition model until the model converges;
s5, reconstructing a prediction matrix by using the converged self-adaptive weighting integrated matrix decomposition model to serve as a final result of the biological entity association prediction.
Compared with the prior art, the method introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.
Specifically, similar techniques such as MF-SDAE, although also incorporating data from multiple sources, employ matrix decomposition methods. In the way of integrating multi-source data, MF-SDAE integrates multi-source data by simply stacking feature matrices; the method adopts a self-adaptive weighting superposition mode, can automatically adjust the weight among algorithms in the model optimization process, and can be more flexibly adapted to different data, so that the prediction is more robust. Meanwhile, compared with an MF-SDAE matrix decomposition method, the embodiment of the invention uses a Logistic matrix decomposition method for modeling, which is more in line with the characteristic that biomedical associated data has only 01 binary values, and can more accurately describe biomedical associated data.
In general, in biomedical networks, in generalInvestigation of the association between two classes of biological entities, such as drugs, protein targets, mirnas, etc., can use two types of nodes a= { a 1 ,a 2 ,…,a m} and B={b1 ,b 2 ,…,b n To represent two sets of biological entities, respectively. The association matrix Y epsilon {0,1} m×n Representing a known biomedical association, when Y ij When=1, represent a i And b j There is an association between them; when Y is ij When=0, represent a i And b j The correlation between the two is unknown. The purpose of biomedical network association prediction algorithms is to predict the association pair with the highest probability of association among the unknown association pairs. In biomedical association prediction tasks, the original similarity matrix of biological entities is used in addition to the known association matrixAndas input. Then, the prediction results correspondingly calculated with M algorithms can be expressed as { Y } (1) ,Y (2) ,…,Y (M) }。
In the prediction matrix reconstructed in step S5, the larger the element values, the greater the probability that the corresponding biomedical association pair is a potential association.
As a preferred embodiment, in said step S2 a prediction similarity matrix for said biological entity is obtained by calculation of the following formula and
Wherein a= { a 1 ,a 2 ,…,a m B= { B 1 ,b 2 ,…,b n Respectively representing the collection of biological entities; y is Y (l) (a i) and Y(l) (b i ) Respectively representing the predicted result Y of the first algorithm (l) Is the (a) th i Individual row vectors, b i A plurality of column vectors; the parameters controlling the bandwidth of the function are set as and
Specifically, the above steps consist in predicting the result Y for the first algorithm (l) Constructing a prediction similarity matrix according to the row and column directions respectively andThe construction method is the same as Guassian Interaction Profile method.
Further, in the step S2, the prediction similarity matrix is thinned by the following formula:
wherein ,N(ai ) Represents a i Neighbor set of (b), N (b) i ) Represents b i Is a neighbor set of the node (a).
Specifically, through the steps, K neighbors are used for sparsifying each prediction similarity matrix, so that weak links possibly with noise can be filtered while strong links in a network are maintained.
Further, in said step S2 an integrated similarity matrix G of said biological entity is obtained by calculation according to the formula AS G BS :
Specifically, in the above steps, the integration is performed by adopting a weighted superposition manner, so that the consistent information and the complementary information between the prediction results of each algorithm can be effectively utilized.
Further, in the step S3, the original similarity matrix S is mapped by the following formula A and SB Extracting low-dimensional features F A and FB :
Wherein the dimension of the low-dimensional feature is set to f A (f A<m) and fB (f B <n)。
In particular, the low-dimensional features obtained in the above steps will provide information of the original similarity matrix for the matrix factorization model of the next step, while providing a compact representation.
As a preferred embodiment, the integrated similarity matrix and low-dimensional features may be used to construct two low-dimensional feature spaces for two classes of biomedical entities a and B, respectively, and to find the essential links between the features in the two feature spaces. On the basis of logic matrix decomposition, the self-adaptive weighting integrated matrix decomposition model is expressed according to the following formula:
wherein ,GAS F A Form a characteristic representation of a class A biological entity, G BS F B A characteristic representation of the class B biological entity is constituted, andRepresenting A, B two classes of biological entities projected onto a shared k (k.ltoreq.min (f) A ,f B ) Embedding matrix in dimensional space, u i Represents the ith row vector of U, v j Represents the j-th row vector of V.
Specifically, since only two values of 0 and 1 are known for the elements in the correlation matrix Y, the present embodiment fits the observed data with the bernoulli distribution. In biomedical network association prediction problems, positive examples are verified biomedical association pairs, and negative examples are association pairs with unknown association conditions, so that positive examples have higher reliability than negative examples. To emphasize the role of the positive examples during the training process, one positive example can be regarded as c (c > 1) positive examples to train during the training process. This parameter is set in the model to the default parameter c=5. Assuming that each training sample is independent, the conditional probability of the observed data is:
assuming that U, V all obey a gaussian distribution of zero mean over the prior:
wherein I is an identity matrix. From bayesian inference, the posterior probability of the model parameters U, V can be obtained as follows:
subsequently, by maximizing the log posterior probability, the following objective function can be equivalently established:
wherein The present embodiment is to adaptively adjust the weights between algorithms +.> andThe weights are also added as optimization variables to the objective function; the present embodiment introduces entropy regularization term for the objective function to control the distribution of weights to prevent the weights from overfitting to a certain algorithm.
Therefore, after further sorting, in the step S4, training and optimizing the adaptive weighted integration matrix decomposition model is achieved by solving the following objective function:
wherein M is the number of biomedical network association prediction algorithms.
Further, in the process of each iteration update in the step S4, the optimization variables U, V are updated alternately in turn,
Specifically, in the process of updating the optimized variables, the embodiment fixes three other variables when solving each variable, and sequentially performs optimization solving on 4 variables in this way.
Further, in the step S4, the optimization variables U, V are updated alternately in sequence,The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows:
wherein, the ". Iy represents the Hadamard product operator between the matrices; specifically, the variables U and V can be updated with an adaptive gradient descent optimizer Adagrad;
Further, in the step S4, the convergence condition for training and optimizing the adaptive weighted integration matrix decomposition model is:
wherein ,L(k ) Representing the value of the objective function at the kth iteration.
The biomedical network association prediction method based on the integrated learning of the present embodiment will be described below with reference to specific evaluation experiments:
first, the present embodiment performs an evaluation experiment on two different types of biomedical association data sets. Wherein the Enzyme dataset is a biomedical association dataset describing drug target interactions, the two types of biomedical entities in the dataset being drug and protein targets; the Enzyme dataset contained 445 drugs, 664 protein targets, 2926 drug targets correlated to that; it provides not only the known drug target associations, but also a drug structural similarity matrix and a protein sequence similarity matrix; the acquisition of the association information in the Enzyme dataset is derived from four databases of KEGG BRITE, BRENDA, superTarget and drug Bank. Whereas the lncrrnadisease 2015 dataset is a bio-associated dataset describing interactions between lncRNA and disease, two types of biomedical entities in the dataset are lncRNA and disease; the LncRNADisease2015 dataset contains 285 lnrnas, 226 diseases, 621 lncRNA-disease associations; which provides both lncRNA similarity and functional similarity of disease; the LncRNADisease2015 dataset was obtained by searching the LncRNADisease database of 2015 version, filtering duplicate lncRNA-disease association records.
The biomedical network association prediction method based on the integrated learning needs the prediction results of a plurality of algorithms as input. The evaluation experiment of this embodiment relates specifically to seven algorithms GRMF, NRLMF, KBMF, CMF, SIMCLDA, BLMNII, netLapRLS applied to biomedical associative prediction tasks. In the experiment, the prediction results of the seven algorithms are integrated to perform comprehensive prediction, and the parameter setting of each algorithm refers to the default parameters in the original paper.
Regarding the selection of the verification method and the evaluation index, the evaluation experiment of the embodiment adopts a ten-fold cross verification mode, the drug target incidence matrix Y is divided into ten mutually disjoint subsets on average, one subset is taken as a test set in turn, the rest subset is taken as a training set, the elements belonging to the test set in the incidence matrix Y are set to be 0, and the elements belonging to the training set are kept unchanged, so that the training data of each fold cross verification is constructed. In each fold cross-validation, the model inputs training data to obtain a corresponding predictive probability matrix. In order to evaluate whether the predicted value of the test set sample in the predicted probability matrix accords with the label of the test set sample in the known correlation matrix Y, and the AUC is selected as an evaluation index, the specific calculation method is as follows:
calculation method of evaluation index AUC. For a two-classification problem, the sample can be divided into positive (positive) and negative (negative), and the labels "1" and "0" generally represent positive and negative samples, respectively. After the classification prediction is performed, four cases occur:
(1) If a sample is a Positive example and is predicted to be a Positive example, the sample is classified as a True Positive (TP);
(2) If a sample is positive but predicted to be Negative, the sample is classified as a False Negative (FN);
(3) If a sample is negative, but predicted to be Positive, the sample is classified as a False Positive (FP);
(4) If a sample is Negative and is predicted to be Negative, the sample is classified as True Negative (TN).
True positive rate: tpr=tp/(tp+fn);
false positive rate: fpr=fp/(fp+tn);
and drawing an ROC curve by taking TPR as a y axis and FPR as an x axis, wherein an AUC value is the area surrounded by the ROC curve and a coordinate axis x and y. The larger the AUC value, the better the predictive performance of the classifier.
Considering the effect of random initialization of the optimization variables and data set partitioning, we repeatedly performed five experiments using different random seeds.
Regarding parameter setting and effect evaluation, partial parameters of the adaptive weighting integrated matrix decomposition model are set to empirical values, in addition, the training weight of a positive sample c=5, the KNN neighbor number k=30, and the low-dimensional feature dimension f is extracted A=100 and fB =100. For the super-parameters in the model, the dimension k, L2 regularization term coefficient lambda and entropy regularization term coefficient lambda of the matrix decomposition w The evaluation experiment of this embodiment finds the optimal parameters by using the grid search method: the super parameter range is set to k epsilon {10,20,30,40,50}, lambda epsilon {2 } -3 ,2 -2 ,2 -1 ,2 0 ,2 1 ,2 2 ,2 3 },λ w ∈{2 -3 ,2 -2 ,2 -1 ,2 0 ,2 1 ,2 2 ,2 3 }. The evaluation experiments of this example performed a parameter sensitivity analysis on the lncrna disease2015 dataset for these three super parameters. As can be seen from fig. 4, the lower dimensionality of the matrix decomposition enables the model to achieve better predictive performance. As seen from fig. 5, the predictive performance of the model is optimized at the parameter L2 regularization coefficient λ=1. For entropy regularization term coefficient lambda w The higher the value, the more uneven the weights between algorithms, the more heavily the advantageous algorithm will get, whereas the weaker algorithm will get less heavily. FIG. 6 is an experimental result showing the proper improvement of the entropy regularization term coefficient λ w The model can be given a higher AUC score, which also illustrates that the weight is weighted directly compared to andThe design of the adaptive weights in the model can effectively improve the prediction performance of the model by setting the model as the average weight.
Five times of repeated ten-fold cross-validation experiments are carried out on two data sets of Enzyme and LncRNADisease2015, an integrated matrix decomposition method under the optimal parameters is compared with all methods participating in integration such as GRMF, and AUC is uniformly used as an evaluation index, wherein the experimental results are as follows:
Enzyme | lncRNADisease2015 | |
GRMF | 0.9655±0.002587 | 0.755601±0.008869 |
NRLMF | 0.976221±0.001731 | 0.787295±0.008070 |
KBMF | 0.89816±0.002788 | 0.776094±0.012368 |
CMF | 0.92171±0.011913 | 0.719111±0.017388 |
SIMCLDA | 0.791411±0.004790 | 0.835838±0.008980 |
BLMNII | 0.965645±0.006894 | 0.710114±0.025129 |
NetlapRLS | 0.950058±0.002805 | 0.781431±0.009761 |
EnsembleMF | 0.980072±0.001393 | 0.880177±0..004781 |
it can be seen that the AUC score for our integrated method can be higher than the performance of the method involved in the integration on both the Enzyme and lncrrnadisease 2015 data sets. The method for performing the associative prediction of the biomedical network based on the integrated learning can effectively integrate the advantages of different algorithms to perform the associative prediction.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.
Claims (10)
1. The biomedical network association prediction method based on the ensemble learning is characterized by comprising the following steps of:
s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; selecting M existing biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;
s2, respectively calculating and obtaining a prediction similarity matrix of the biological entity according to the prediction results of the algorithms; after the sparse processing is carried out on the prediction similarity matrix, the integrated similarity matrix of the biological entity is obtained by combining the prediction results of all algorithms and calculating in a weighted superposition mode;
s3, extracting low-dimensional features from the original similarity matrix by utilizing singular value decomposition, and constructing a self-adaptive weighted integrated matrix decomposition model by combining the integrated similarity matrix;
s4, training and optimizing the self-adaptive weighting integrated matrix decomposition model until the model converges;
s5, reconstructing a prediction matrix by using the converged self-adaptive weighting integrated matrix decomposition model to serve as a final result of the biological entity association prediction.
2. The ensemble learning-based biomedical network association prediction method according to claim 1, wherein the prediction similarity matrix of the biological entity is obtained by calculation in the step S2 by the following formula and
Wherein a= { a 1 ,a 2 ,…,a m B= { B 1 ,b 2 ,…,b n Respectively representing the collection of biological entities; y is Y (l) (a i) and Y(l) (b i ) Respectively representing the predicted result Y of the first algorithm (l) Is the (a) th i Individual row vectors, b i A plurality of column vectors; the parameters controlling the bandwidth of the function are set as and
3. The biomedical network association prediction method based on ensemble learning according to claim 2, wherein in said step S2, the prediction similarity matrix is thinned by the following formula:
wherein ,N(ai ) Represents a i Neighbor set of (b), N (b) i ) Represents b i Is a neighbor set of the node (a).
4. The ensemble learning based biomedical network association prediction method according to claim 3, wherein in said step S2, the ensemble similarity moment of the biological entity is obtained by calculation of the following formulaArray G AS G BS :
5. The ensemble learning-based biomedical network association prediction method according to claim 4, wherein said original similarity matrix S is applied in said step S3 by the following formula A and SB Extracting low-dimensional features F A and FB :
Wherein the dimension of the low-dimensional feature is set to f A F B ,f A <m,f B <n。
6. The ensemble learning-based biomedical network association prediction method as claimed in claim 5, wherein the adaptive weighted integration matrix decomposition model is expressed as follows:
wherein ,GAS F A Form a characteristic representation of a class A biological entity, G BS F B A characteristic representation of the class B biological entity is constituted, andRepresenting an embedding matrix of projections of two classes of biological entities A, B into shared k-dimensional space, k.ltoreq.min (f A ,f B ),u i Represents the ith row vector of U, v j Represents the j-th row vector of V.
7. The ensemble learning-based biomedical network association prediction method as claimed in claim 6, wherein in step S4, training optimization of the adaptive weighted integration matrix decomposition model is achieved by solving the following objective function:
wherein M is the number of biomedical network association prediction algorithms.
9. Root of Chinese characterThe biomedical network association prediction method based on ensemble learning according to claim 8, wherein the optimization variables U, V are updated alternately in the sequence of step S4,The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows: />
Wherein, the ". Iy represents the Hadamard product operator between the matrices;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110236007.3A CN112951320B (en) | 2021-03-03 | 2021-03-03 | Biomedical network association prediction method based on ensemble learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110236007.3A CN112951320B (en) | 2021-03-03 | 2021-03-03 | Biomedical network association prediction method based on ensemble learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112951320A CN112951320A (en) | 2021-06-11 |
CN112951320B true CN112951320B (en) | 2023-05-16 |
Family
ID=76247425
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110236007.3A Active CN112951320B (en) | 2021-03-03 | 2021-03-03 | Biomedical network association prediction method based on ensemble learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112951320B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115295072B (en) * | 2022-10-10 | 2023-01-24 | 山东大学 | Protein interaction site prediction method and system based on graph neural network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
CN109243538A (en) * | 2018-07-19 | 2019-01-18 | 长沙学院 | A kind of method and system of predictive disease and LncRNA incidence relation |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11037684B2 (en) * | 2014-11-14 | 2021-06-15 | International Business Machines Corporation | Generating drug repositioning hypotheses based on integrating multiple aspects of drug similarity and disease similarity |
CN107862179A (en) * | 2017-11-06 | 2018-03-30 | 中南大学 | A kind of miRNA disease association Relationship Prediction methods decomposed based on similitude and logic matrix |
CN109920476A (en) * | 2019-01-30 | 2019-06-21 | 中国矿业大学 | The disease associated prediction technique of miRNA- based on chaos game playing algorithm |
CN110993121A (en) * | 2019-12-06 | 2020-04-10 | 南开大学 | Drug association prediction method based on double-cooperation linear manifold |
CN111681705B (en) * | 2020-05-21 | 2024-05-24 | 中国科学院深圳先进技术研究院 | MiRNA-disease association prediction method, system, terminal and storage medium |
CN112183837A (en) * | 2020-09-22 | 2021-01-05 | 曲阜师范大学 | miRNA and disease association relation prediction method based on self-coding model |
-
2021
- 2021-03-03 CN CN202110236007.3A patent/CN112951320B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109243538A (en) * | 2018-07-19 | 2019-01-18 | 长沙学院 | A kind of method and system of predictive disease and LncRNA incidence relation |
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
Also Published As
Publication number | Publication date |
---|---|
CN112951320A (en) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yamada et al. | Feature selection using stochastic gates | |
CN111785329B (en) | Single-cell RNA sequencing clustering method based on countermeasure automatic encoder | |
Nguyen et al. | Hidden Markov models for cancer classification using gene expression profiles | |
Hashemi et al. | A fuzzy C-means algorithm for optimizing data clustering | |
Yang et al. | Locally sparse neural networks for tabular biomedical data | |
Özbılge et al. | Tomato disease recognition using a compact convolutional neural network | |
CN111863123B (en) | Gene synthesis death association prediction method | |
CN107609589A (en) | A kind of feature learning method of complex behavior sequence data | |
Da et al. | Brain CT image classification with deep neural networks | |
Rojas-Thomas et al. | Neural networks ensemble for automatic DNA microarray spot classification | |
CN112951321A (en) | Tensor decomposition-based miRNA-disease association prediction method and system | |
CN116886398A (en) | Internet of things intrusion detection method based on feature selection and integrated learning | |
CN112951320B (en) | Biomedical network association prediction method based on ensemble learning | |
Saheed et al. | Microarray gene expression data classification via Wilcoxon sign rank sum and novel Grey Wolf optimized ensemble learning models | |
Zubair et al. | A group feature ranking and selection method based on dimension reduction technique in high-dimensional data | |
CN113421614A (en) | Tensor decomposition-based lncRNA-disease association prediction method | |
Bharathi et al. | Optimal feature subset selection using differential evolution and extreme learning machine | |
Bazgir et al. | REFINED (REpresentation of features as images with NEighborhood Dependencies): A novel feature representation for convolutional neural networks | |
Chellamuthu et al. | Data mining and machine learning approaches in breast cancer biomedical research | |
Jadhav et al. | Kernel-based exponential grey wolf optimizer for rapid centroid estimation in data clustering | |
Özyer et al. | Multi-objective genetic algorithm based clustering approach and its application to gene expression data | |
Endah et al. | The study of synthetic minority over-sampling technique (SMOTE) and weighted extreme learning machine for handling imbalance problem on multiclass microarray classification | |
Kamkar et al. | Exploiting feature relationships towards stable feature selection | |
Tian et al. | Microbial Network Recovery by Compositional Graphical Lasso | |
Safwan et al. | Classification of breast cancer and grading of diabetic retinopathy & macular edema using ensemble of pre-trained convolutional neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |