CN112951320B - Biomedical network association prediction method based on ensemble learning - Google Patents

Biomedical network association prediction method based on ensemble learning Download PDF

Info

Publication number
CN112951320B
CN112951320B CN202110236007.3A CN202110236007A CN112951320B CN 112951320 B CN112951320 B CN 112951320B CN 202110236007 A CN202110236007 A CN 202110236007A CN 112951320 B CN112951320 B CN 112951320B
Authority
CN
China
Prior art keywords
prediction
matrix
biomedical
network association
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110236007.3A
Other languages
Chinese (zh)
Other versions
CN112951320A (en
Inventor
欧阳乐
卢帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202110236007.3A priority Critical patent/CN112951320B/en
Publication of CN112951320A publication Critical patent/CN112951320A/en
Application granted granted Critical
Publication of CN112951320B publication Critical patent/CN112951320B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

Aiming at the limitations of the prior art, the invention provides a biomedical network association prediction method based on ensemble learning, which introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.

Description

Biomedical network association prediction method based on ensemble learning
Technical Field
The invention relates to the technical field of computational biology, in particular to biological data mining, and more particularly relates to a biomedical network association prediction method based on ensemble learning.
Background
The development of complex diseases such as cancer often results not from deregulation and mutation of individual biomolecules but from dysfunctions of regulatory networks composed of interactions between biomolecules. In the course of disease occurrence and development, abnormal behaviors of some biomolecules occur, and the identification of abnormal biomolecules with high correlation with disease occurrence is very helpful for the prevention, diagnosis and treatment of diseases. In recent years, many studies have been made to verify the association between different types of biological entities, such as association of drugs with protein targets, association of diseases with micro RNAs, association of diseases with Long non-coding RNAs (lncrnas), and the like, through biological experiments. However, identifying new biomedical associations by biological experimentation requires a significant amount of time and is costly. In the computer field, such problems can be abstracted as associated predictive problems for a two-way network. The conceptual diagram of the biomedical binary network can be seen in fig. 1. Thus, predicting potential correlation networks by computational methods, thereby providing references and suggestions for biological experiments, will help to improve the efficiency of biomedical correlation recognition of this task and reduce costs.
In the last decade, various computing methods have emerged that are applied to biomedical network-related prediction tasks. According to the principle, three kinds of methods can be roughly classified: network diffusion model, feature-based classification method, and matrix decomposition-based method. The network diffusion model mainly uses a graph-based method to carry out diffusion propagation on the association in the biomedical network, so as to predict the potential association in the network. The feature-based classification method is to represent each association by the features of nodes of both the association parties, and then input the association into a machine learning model for training. Matrix decomposition-based methods attempt to learn two or more low-dimensional factor matrices from biomedical correlation matrices and then multiply them to reconstruct a correlation matrix. However, in the face of biomedical association networks of varying types, assumptions that rely solely on a single predictive approach may not accurately characterize all data.
Publication date 2020.04.10, publication number: chinese invention patent CN110993113 a: the method and the system for predicting the lncRNA-disease relationship based on MF-SDAE attempt to extract various characteristics of lncRNA and various characteristics of diseases by utilizing a plurality of lncRNA databases and a plurality of disease databases so as to provide a rapid and effective scheme, but the scheme still has certain limitations.
Disclosure of Invention
Aiming at the limitation of the prior art, the invention provides a biomedical network association prediction method based on ensemble learning, which adopts the following technical scheme:
a biomedical network association prediction method based on ensemble learning comprises the following steps:
s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; respectively applying a plurality of biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;
s2, respectively calculating and obtaining a prediction similarity matrix of the biological entity according to the prediction results of the algorithms; after the sparse processing is carried out on the prediction similarity matrix, the integrated similarity matrix of the biological entity is obtained by combining the prediction results of all algorithms and calculating in a weighted superposition mode;
s3, extracting low-dimensional features from the original similarity matrix by utilizing singular value decomposition, and constructing a self-adaptive weighted integrated matrix decomposition model by combining the integrated similarity matrix;
s4, training and optimizing the self-adaptive weighting integrated matrix decomposition model until the model converges;
s5, reconstructing a prediction matrix by using the converged self-adaptive weighting integrated matrix decomposition model to serve as a final result of the biological entity association prediction.
Compared with the prior art, the method introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.
As a preferred embodiment, in the step S2, the predicted similarity matrix of the biological entity is obtained by calculation according to the following formula
Figure GDA0004054347040000021
and
Figure GDA0004054347040000022
Figure GDA0004054347040000023
Figure GDA0004054347040000024
Wherein a= { a 1 ,a 2 ,…,a m B= { B 1 ,b 2 ,…,b n Respectively representing the collection of biological entities; y is Y (l) (a i) and Y(l) (b i ) Respectively representing the predicted result Y of the first algorithm (l) Is the (a) th i Individual row vectors, b i A plurality of column vectors; the parameters controlling the bandwidth of the function are set as
Figure GDA0004054347040000031
and
Figure GDA0004054347040000032
Further, in the step S2, the prediction similarity matrix is thinned by the following formula:
Figure GDA0004054347040000033
Figure GDA0004054347040000034
wherein ,N(ai ) Represents a i Neighbor set of (b), N (b) i ) Represents b i Is a neighbor set of the node (a).
Further, in said step S2 an integrated similarity matrix G of said biological entity is obtained by calculation according to the formula AS G BS
Figure GDA0004054347040000035
Figure GDA0004054347040000036
wherein ,
Figure GDA0004054347040000037
and
Figure GDA0004054347040000038
Is the weight of adaptive learning.
Further, in the step S3, the original similarity matrix S is mapped by the following formula A and SB Extracting low-dimensional features F A and FB
Figure GDA0004054347040000039
Figure GDA00040543470400000310
Wherein the dimension of the low-dimensional feature is set to f A (f A<m) and fB (f B <n)。
As a preferred scheme, the adaptive weighted integration matrix decomposition model is expressed by the following formula:
Figure GDA0004054347040000041
wherein ,GAS F A Form a characteristic representation of a class A biological entity, G BS F B A characteristic representation of the class B biological entity is constituted,
Figure GDA0004054347040000042
and
Figure GDA0004054347040000043
Representing A, B two classes of biological entities projected onto a shared k (k.ltoreq.min (f) A ,f B ) Embedding matrix in dimensional space, u i Represents the ith row vector of U, v j Represents the j-th row vector of V.
Further, in the step S4, training and optimizing the adaptive weighted integration matrix decomposition model is implemented by solving the following objective function:
Figure GDA0004054347040000044
Figure GDA0004054347040000045
Figure GDA0004054347040000046
wherein M is the number of biomedical network association prediction algorithms.
Further, in the process of each iteration update in the step S4, the optimization variables U, V are updated alternately in turn,
Figure GDA0004054347040000047
Further, in the step S4, the optimization variables U, V are updated alternately in sequence,
Figure GDA0004054347040000048
The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows:
Figure GDA0004054347040000051
Figure GDA0004054347040000052
wherein, the ". Iy represents the Hadamard product operator between the matrices;
Figure GDA0004054347040000053
the updated formula of (c) is as follows:
Figure GDA0004054347040000054
Figure GDA0004054347040000055
further, in the step S4, the convergence condition for training and optimizing the adaptive weighted integration matrix decomposition model is:
Figure GDA0004054347040000056
wherein ,L(k ) Representing the value of the objective function at the kth iteration.
Drawings
FIG. 1 is a conceptual diagram of a biomedical binary network;
FIG. 2 is a flowchart of steps of a biomedical network association prediction method based on ensemble learning according to an embodiment of the present invention;
FIG. 3 is a logic diagram of a biomedical network association prediction method based on ensemble learning according to an embodiment of the present invention;
FIG. 4 is a graph of AUC over lncRNADisease2015 dataset as a function of parameter k in an evaluation experiment in accordance with an embodiment of the present invention;
FIG. 5 is a graph of AUC over lncRNADisease2015 dataset as a function of parameter λ in an evaluation experiment in accordance with an embodiment of the present invention;
FIG. 6 is a graph showing the dependence of the parameter lambda on the lncRNADisease2015 dataset in an evaluation experiment according to an embodiment of the present invention w AUC curve of the variation of (a).
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the embodiments of the present application, are within the scope of the embodiments of the present application.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims. In the description of this application, it should be understood that the terms "first," "second," "third," and the like are used merely to distinguish between similar objects and are not necessarily used to describe a particular order or sequence, nor should they be construed to indicate or imply relative importance. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.
Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The invention is further illustrated in the following figures and examples.
In order to solve the limitations of the prior art, the present embodiment provides a technical solution, and the technical solution of the present invention is further described below with reference to the drawings and the embodiments.
Referring to fig. 2, the biomedical network association prediction method based on ensemble learning includes the following steps:
s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; respectively applying a plurality of biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;
s2, respectively calculating and obtaining a prediction similarity matrix of the biological entity according to the prediction results of the algorithms; after the sparse processing is carried out on the prediction similarity matrix, the integrated similarity matrix of the biological entity is obtained by combining the prediction results of all algorithms and calculating in a weighted superposition mode;
s3, extracting low-dimensional features from the original similarity matrix by utilizing singular value decomposition, and constructing a self-adaptive weighted integrated matrix decomposition model by combining the integrated similarity matrix;
s4, training and optimizing the self-adaptive weighting integrated matrix decomposition model until the model converges;
s5, reconstructing a prediction matrix by using the converged self-adaptive weighting integrated matrix decomposition model to serve as a final result of the biological entity association prediction.
Compared with the prior art, the method introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.
Specifically, similar techniques such as MF-SDAE, although also incorporating data from multiple sources, employ matrix decomposition methods. In the way of integrating multi-source data, MF-SDAE integrates multi-source data by simply stacking feature matrices; the method adopts a self-adaptive weighting superposition mode, can automatically adjust the weight among algorithms in the model optimization process, and can be more flexibly adapted to different data, so that the prediction is more robust. Meanwhile, compared with an MF-SDAE matrix decomposition method, the embodiment of the invention uses a Logistic matrix decomposition method for modeling, which is more in line with the characteristic that biomedical associated data has only 01 binary values, and can more accurately describe biomedical associated data.
In general, in biomedical networks, in generalInvestigation of the association between two classes of biological entities, such as drugs, protein targets, mirnas, etc., can use two types of nodes a= { a 1 ,a 2 ,…,a m} and B={b1 ,b 2 ,…,b n To represent two sets of biological entities, respectively. The association matrix Y epsilon {0,1} m×n Representing a known biomedical association, when Y ij When=1, represent a i And b j There is an association between them; when Y is ij When=0, represent a i And b j The correlation between the two is unknown. The purpose of biomedical network association prediction algorithms is to predict the association pair with the highest probability of association among the unknown association pairs. In biomedical association prediction tasks, the original similarity matrix of biological entities is used in addition to the known association matrix
Figure GDA0004054347040000081
And
Figure GDA0004054347040000082
as input. Then, the prediction results correspondingly calculated with M algorithms can be expressed as { Y } (1) ,Y (2) ,…,Y (M) }。
In the prediction matrix reconstructed in step S5, the larger the element values, the greater the probability that the corresponding biomedical association pair is a potential association.
As a preferred embodiment, in said step S2 a prediction similarity matrix for said biological entity is obtained by calculation of the following formula
Figure GDA0004054347040000083
and
Figure GDA0004054347040000084
Figure GDA0004054347040000085
Figure GDA0004054347040000086
Wherein a= { a 1 ,a 2 ,…,a m B= { B 1 ,b 2 ,…,b n Respectively representing the collection of biological entities; y is Y (l) (a i) and Y(l) (b i ) Respectively representing the predicted result Y of the first algorithm (l) Is the (a) th i Individual row vectors, b i A plurality of column vectors; the parameters controlling the bandwidth of the function are set as
Figure GDA0004054347040000087
and
Figure GDA0004054347040000088
Specifically, the above steps consist in predicting the result Y for the first algorithm (l) Constructing a prediction similarity matrix according to the row and column directions respectively
Figure GDA0004054347040000089
and
Figure GDA00040543470400000810
The construction method is the same as Guassian Interaction Profile method.
Further, in the step S2, the prediction similarity matrix is thinned by the following formula:
Figure GDA00040543470400000811
Figure GDA0004054347040000091
wherein ,N(ai ) Represents a i Neighbor set of (b), N (b) i ) Represents b i Is a neighbor set of the node (a).
Specifically, through the steps, K neighbors are used for sparsifying each prediction similarity matrix, so that weak links possibly with noise can be filtered while strong links in a network are maintained.
Further, in said step S2 an integrated similarity matrix G of said biological entity is obtained by calculation according to the formula AS G BS
Figure GDA0004054347040000092
Figure GDA0004054347040000093
wherein ,
Figure GDA0004054347040000094
and
Figure GDA0004054347040000095
Is the weight of adaptive learning.
Specifically, in the above steps, the integration is performed by adopting a weighted superposition manner, so that the consistent information and the complementary information between the prediction results of each algorithm can be effectively utilized.
Further, in the step S3, the original similarity matrix S is mapped by the following formula A and SB Extracting low-dimensional features F A and FB
Figure GDA0004054347040000096
Figure GDA0004054347040000097
Wherein the dimension of the low-dimensional feature is set to f A (f A<m) and fB (f B <n)。
In particular, the low-dimensional features obtained in the above steps will provide information of the original similarity matrix for the matrix factorization model of the next step, while providing a compact representation.
As a preferred embodiment, the integrated similarity matrix and low-dimensional features may be used to construct two low-dimensional feature spaces for two classes of biomedical entities a and B, respectively, and to find the essential links between the features in the two feature spaces. On the basis of logic matrix decomposition, the self-adaptive weighting integrated matrix decomposition model is expressed according to the following formula:
Figure GDA0004054347040000101
wherein ,GAS F A Form a characteristic representation of a class A biological entity, G BS F B A characteristic representation of the class B biological entity is constituted,
Figure GDA0004054347040000102
and
Figure GDA0004054347040000103
Representing A, B two classes of biological entities projected onto a shared k (k.ltoreq.min (f) A ,f B ) Embedding matrix in dimensional space, u i Represents the ith row vector of U, v j Represents the j-th row vector of V.
Specifically, since only two values of 0 and 1 are known for the elements in the correlation matrix Y, the present embodiment fits the observed data with the bernoulli distribution. In biomedical network association prediction problems, positive examples are verified biomedical association pairs, and negative examples are association pairs with unknown association conditions, so that positive examples have higher reliability than negative examples. To emphasize the role of the positive examples during the training process, one positive example can be regarded as c (c > 1) positive examples to train during the training process. This parameter is set in the model to the default parameter c=5. Assuming that each training sample is independent, the conditional probability of the observed data is:
Figure GDA0004054347040000104
assuming that U, V all obey a gaussian distribution of zero mean over the prior:
Figure GDA0004054347040000105
wherein I is an identity matrix. From bayesian inference, the posterior probability of the model parameters U, V can be obtained as follows:
subsequently, by maximizing the log posterior probability, the following objective function can be equivalently established:
Figure GDA0004054347040000111
wherein
Figure GDA0004054347040000112
The present embodiment is to adaptively adjust the weights between algorithms +.>
Figure GDA0004054347040000113
and
Figure GDA0004054347040000114
The weights are also added as optimization variables to the objective function; the present embodiment introduces entropy regularization term for the objective function to control the distribution of weights to prevent the weights from overfitting to a certain algorithm.
Therefore, after further sorting, in the step S4, training and optimizing the adaptive weighted integration matrix decomposition model is achieved by solving the following objective function:
Figure GDA0004054347040000115
Figure GDA0004054347040000116
Figure GDA0004054347040000117
wherein M is the number of biomedical network association prediction algorithms.
Further, in the process of each iteration update in the step S4, the optimization variables U, V are updated alternately in turn,
Figure GDA0004054347040000118
Specifically, in the process of updating the optimized variables, the embodiment fixes three other variables when solving each variable, and sequentially performs optimization solving on 4 variables in this way.
Further, in the step S4, the optimization variables U, V are updated alternately in sequence,
Figure GDA0004054347040000121
The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows:
Figure GDA0004054347040000122
Figure GDA0004054347040000123
wherein, the ". Iy represents the Hadamard product operator between the matrices; specifically, the variables U and V can be updated with an adaptive gradient descent optimizer Adagrad;
Figure GDA0004054347040000124
the updated formula of (c) is as follows: />
Figure GDA0004054347040000125
Figure GDA0004054347040000126
Further, in the step S4, the convergence condition for training and optimizing the adaptive weighted integration matrix decomposition model is:
Figure GDA0004054347040000127
wherein ,L(k ) Representing the value of the objective function at the kth iteration.
The biomedical network association prediction method based on the integrated learning of the present embodiment will be described below with reference to specific evaluation experiments:
first, the present embodiment performs an evaluation experiment on two different types of biomedical association data sets. Wherein the Enzyme dataset is a biomedical association dataset describing drug target interactions, the two types of biomedical entities in the dataset being drug and protein targets; the Enzyme dataset contained 445 drugs, 664 protein targets, 2926 drug targets correlated to that; it provides not only the known drug target associations, but also a drug structural similarity matrix and a protein sequence similarity matrix; the acquisition of the association information in the Enzyme dataset is derived from four databases of KEGG BRITE, BRENDA, superTarget and drug Bank. Whereas the lncrrnadisease 2015 dataset is a bio-associated dataset describing interactions between lncRNA and disease, two types of biomedical entities in the dataset are lncRNA and disease; the LncRNADisease2015 dataset contains 285 lnrnas, 226 diseases, 621 lncRNA-disease associations; which provides both lncRNA similarity and functional similarity of disease; the LncRNADisease2015 dataset was obtained by searching the LncRNADisease database of 2015 version, filtering duplicate lncRNA-disease association records.
The biomedical network association prediction method based on the integrated learning needs the prediction results of a plurality of algorithms as input. The evaluation experiment of this embodiment relates specifically to seven algorithms GRMF, NRLMF, KBMF, CMF, SIMCLDA, BLMNII, netLapRLS applied to biomedical associative prediction tasks. In the experiment, the prediction results of the seven algorithms are integrated to perform comprehensive prediction, and the parameter setting of each algorithm refers to the default parameters in the original paper.
Regarding the selection of the verification method and the evaluation index, the evaluation experiment of the embodiment adopts a ten-fold cross verification mode, the drug target incidence matrix Y is divided into ten mutually disjoint subsets on average, one subset is taken as a test set in turn, the rest subset is taken as a training set, the elements belonging to the test set in the incidence matrix Y are set to be 0, and the elements belonging to the training set are kept unchanged, so that the training data of each fold cross verification is constructed. In each fold cross-validation, the model inputs training data to obtain a corresponding predictive probability matrix. In order to evaluate whether the predicted value of the test set sample in the predicted probability matrix accords with the label of the test set sample in the known correlation matrix Y, and the AUC is selected as an evaluation index, the specific calculation method is as follows:
calculation method of evaluation index AUC. For a two-classification problem, the sample can be divided into positive (positive) and negative (negative), and the labels "1" and "0" generally represent positive and negative samples, respectively. After the classification prediction is performed, four cases occur:
(1) If a sample is a Positive example and is predicted to be a Positive example, the sample is classified as a True Positive (TP);
(2) If a sample is positive but predicted to be Negative, the sample is classified as a False Negative (FN);
(3) If a sample is negative, but predicted to be Positive, the sample is classified as a False Positive (FP);
(4) If a sample is Negative and is predicted to be Negative, the sample is classified as True Negative (TN).
Figure GDA0004054347040000141
True positive rate: tpr=tp/(tp+fn);
false positive rate: fpr=fp/(fp+tn);
and drawing an ROC curve by taking TPR as a y axis and FPR as an x axis, wherein an AUC value is the area surrounded by the ROC curve and a coordinate axis x and y. The larger the AUC value, the better the predictive performance of the classifier.
Considering the effect of random initialization of the optimization variables and data set partitioning, we repeatedly performed five experiments using different random seeds.
Regarding parameter setting and effect evaluation, partial parameters of the adaptive weighting integrated matrix decomposition model are set to empirical values, in addition, the training weight of a positive sample c=5, the KNN neighbor number k=30, and the low-dimensional feature dimension f is extracted A=100 and fB =100. For the super-parameters in the model, the dimension k, L2 regularization term coefficient lambda and entropy regularization term coefficient lambda of the matrix decomposition w The evaluation experiment of this embodiment finds the optimal parameters by using the grid search method: the super parameter range is set to k epsilon {10,20,30,40,50}, lambda epsilon {2 } -3 ,2 -2 ,2 -1 ,2 0 ,2 1 ,2 2 ,2 3 },λ w ∈{2 -3 ,2 -2 ,2 -1 ,2 0 ,2 1 ,2 2 ,2 3 }. The evaluation experiments of this example performed a parameter sensitivity analysis on the lncrna disease2015 dataset for these three super parameters. As can be seen from fig. 4, the lower dimensionality of the matrix decomposition enables the model to achieve better predictive performance. As seen from fig. 5, the predictive performance of the model is optimized at the parameter L2 regularization coefficient λ=1. For entropy regularization term coefficient lambda w The higher the value, the more uneven the weights between algorithms, the more heavily the advantageous algorithm will get, whereas the weaker algorithm will get less heavily. FIG. 6 is an experimental result showing the proper improvement of the entropy regularization term coefficient λ w The model can be given a higher AUC score, which also illustrates that the weight is weighted directly compared to
Figure GDA0004054347040000151
and
Figure GDA0004054347040000152
The design of the adaptive weights in the model can effectively improve the prediction performance of the model by setting the model as the average weight.
Five times of repeated ten-fold cross-validation experiments are carried out on two data sets of Enzyme and LncRNADisease2015, an integrated matrix decomposition method under the optimal parameters is compared with all methods participating in integration such as GRMF, and AUC is uniformly used as an evaluation index, wherein the experimental results are as follows:
Enzyme lncRNADisease2015
GRMF 0.9655±0.002587 0.755601±0.008869
NRLMF 0.976221±0.001731 0.787295±0.008070
KBMF 0.89816±0.002788 0.776094±0.012368
CMF 0.92171±0.011913 0.719111±0.017388
SIMCLDA 0.791411±0.004790 0.835838±0.008980
BLMNII 0.965645±0.006894 0.710114±0.025129
NetlapRLS 0.950058±0.002805 0.781431±0.009761
EnsembleMF 0.980072±0.001393 0.880177±0..004781
it can be seen that the AUC score for our integrated method can be higher than the performance of the method involved in the integration on both the Enzyme and lncrrnadisease 2015 data sets. The method for performing the associative prediction of the biomedical network based on the integrated learning can effectively integrate the advantages of different algorithms to perform the associative prediction.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (10)

1. The biomedical network association prediction method based on the ensemble learning is characterized by comprising the following steps of:
s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; selecting M existing biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;
s2, respectively calculating and obtaining a prediction similarity matrix of the biological entity according to the prediction results of the algorithms; after the sparse processing is carried out on the prediction similarity matrix, the integrated similarity matrix of the biological entity is obtained by combining the prediction results of all algorithms and calculating in a weighted superposition mode;
s3, extracting low-dimensional features from the original similarity matrix by utilizing singular value decomposition, and constructing a self-adaptive weighted integrated matrix decomposition model by combining the integrated similarity matrix;
s4, training and optimizing the self-adaptive weighting integrated matrix decomposition model until the model converges;
s5, reconstructing a prediction matrix by using the converged self-adaptive weighting integrated matrix decomposition model to serve as a final result of the biological entity association prediction.
2. The ensemble learning-based biomedical network association prediction method according to claim 1, wherein the prediction similarity matrix of the biological entity is obtained by calculation in the step S2 by the following formula
Figure FDA0004054347030000011
and
Figure FDA0004054347030000012
Figure FDA0004054347030000013
Figure FDA0004054347030000014
Wherein a= { a 1 ,a 2 ,…,a m B= { B 1 ,b 2 ,…,b n Respectively representing the collection of biological entities; y is Y (l) (a i) and Y(l) (b i ) Respectively representing the predicted result Y of the first algorithm (l) Is the (a) th i Individual row vectors, b i A plurality of column vectors; the parameters controlling the bandwidth of the function are set as
Figure FDA0004054347030000015
and
Figure FDA0004054347030000016
3. The biomedical network association prediction method based on ensemble learning according to claim 2, wherein in said step S2, the prediction similarity matrix is thinned by the following formula:
Figure FDA0004054347030000017
Figure FDA0004054347030000021
wherein ,N(ai ) Represents a i Neighbor set of (b), N (b) i ) Represents b i Is a neighbor set of the node (a).
4. The ensemble learning based biomedical network association prediction method according to claim 3, wherein in said step S2, the ensemble similarity moment of the biological entity is obtained by calculation of the following formulaArray G AS G BS
Figure FDA0004054347030000022
Figure FDA0004054347030000023
wherein ,
Figure FDA0004054347030000024
and
Figure FDA0004054347030000025
Is the weight of adaptive learning.
5. The ensemble learning-based biomedical network association prediction method according to claim 4, wherein said original similarity matrix S is applied in said step S3 by the following formula A and SB Extracting low-dimensional features F A and FB
Figure FDA0004054347030000026
Figure FDA0004054347030000027
Wherein the dimension of the low-dimensional feature is set to f A F B ,f A <m,f B <n。
6. The ensemble learning-based biomedical network association prediction method as claimed in claim 5, wherein the adaptive weighted integration matrix decomposition model is expressed as follows:
Figure FDA0004054347030000028
wherein ,GAS F A Form a characteristic representation of a class A biological entity, G BS F B A characteristic representation of the class B biological entity is constituted,
Figure FDA0004054347030000029
and
Figure FDA00040543470300000210
Representing an embedding matrix of projections of two classes of biological entities A, B into shared k-dimensional space, k.ltoreq.min (f A ,f B ),u i Represents the ith row vector of U, v j Represents the j-th row vector of V.
7. The ensemble learning-based biomedical network association prediction method as claimed in claim 6, wherein in step S4, training optimization of the adaptive weighted integration matrix decomposition model is achieved by solving the following objective function:
Figure FDA0004054347030000031
wherein M is the number of biomedical network association prediction algorithms.
8. The biomedical network association prediction method based on integrated learning according to claim 7, wherein in the iterative updating process of each round of step S4, the optimization variables U, V are updated alternately in turn,
Figure FDA0004054347030000032
9. Root of Chinese characterThe biomedical network association prediction method based on ensemble learning according to claim 8, wherein the optimization variables U, V are updated alternately in the sequence of step S4,
Figure FDA0004054347030000033
The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows: />
Figure FDA0004054347030000034
Figure FDA0004054347030000035
Wherein, the ". Iy represents the Hadamard product operator between the matrices;
Figure FDA0004054347030000041
the updated formula of (c) is as follows:
Figure FDA0004054347030000042
Figure FDA0004054347030000043
10. the method according to claim 9, wherein the convergence condition for training and optimizing the adaptive weighted integration matrix decomposition model in step S4 is:
Figure FDA0004054347030000044
wherein ,L(k ) Representing the value of the objective function at the kth iteration.
CN202110236007.3A 2021-03-03 2021-03-03 Biomedical network association prediction method based on ensemble learning Active CN112951320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110236007.3A CN112951320B (en) 2021-03-03 2021-03-03 Biomedical network association prediction method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110236007.3A CN112951320B (en) 2021-03-03 2021-03-03 Biomedical network association prediction method based on ensemble learning

Publications (2)

Publication Number Publication Date
CN112951320A CN112951320A (en) 2021-06-11
CN112951320B true CN112951320B (en) 2023-05-16

Family

ID=76247425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110236007.3A Active CN112951320B (en) 2021-03-03 2021-03-03 Biomedical network association prediction method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN112951320B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115295072B (en) * 2022-10-10 2023-01-24 山东大学 Protein interaction site prediction method and system based on graph neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036553A (en) * 2018-08-01 2018-12-18 北京理工大学 A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge
CN109243538A (en) * 2018-07-19 2019-01-18 长沙学院 A kind of method and system of predictive disease and LncRNA incidence relation

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11037684B2 (en) * 2014-11-14 2021-06-15 International Business Machines Corporation Generating drug repositioning hypotheses based on integrating multiple aspects of drug similarity and disease similarity
CN107862179A (en) * 2017-11-06 2018-03-30 中南大学 A kind of miRNA disease association Relationship Prediction methods decomposed based on similitude and logic matrix
CN109920476A (en) * 2019-01-30 2019-06-21 中国矿业大学 The disease associated prediction technique of miRNA- based on chaos game playing algorithm
CN110993121A (en) * 2019-12-06 2020-04-10 南开大学 Drug association prediction method based on double-cooperation linear manifold
CN111681705B (en) * 2020-05-21 2024-05-24 中国科学院深圳先进技术研究院 MiRNA-disease association prediction method, system, terminal and storage medium
CN112183837A (en) * 2020-09-22 2021-01-05 曲阜师范大学 miRNA and disease association relation prediction method based on self-coding model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243538A (en) * 2018-07-19 2019-01-18 长沙学院 A kind of method and system of predictive disease and LncRNA incidence relation
CN109036553A (en) * 2018-08-01 2018-12-18 北京理工大学 A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge

Also Published As

Publication number Publication date
CN112951320A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
Yamada et al. Feature selection using stochastic gates
CN111785329B (en) Single-cell RNA sequencing clustering method based on countermeasure automatic encoder
Nguyen et al. Hidden Markov models for cancer classification using gene expression profiles
Hashemi et al. A fuzzy C-means algorithm for optimizing data clustering
Yang et al. Locally sparse neural networks for tabular biomedical data
Özbılge et al. Tomato disease recognition using a compact convolutional neural network
CN111863123B (en) Gene synthesis death association prediction method
CN107609589A (en) A kind of feature learning method of complex behavior sequence data
Da et al. Brain CT image classification with deep neural networks
Rojas-Thomas et al. Neural networks ensemble for automatic DNA microarray spot classification
CN112951321A (en) Tensor decomposition-based miRNA-disease association prediction method and system
CN116886398A (en) Internet of things intrusion detection method based on feature selection and integrated learning
CN112951320B (en) Biomedical network association prediction method based on ensemble learning
Saheed et al. Microarray gene expression data classification via Wilcoxon sign rank sum and novel Grey Wolf optimized ensemble learning models
Zubair et al. A group feature ranking and selection method based on dimension reduction technique in high-dimensional data
CN113421614A (en) Tensor decomposition-based lncRNA-disease association prediction method
Bharathi et al. Optimal feature subset selection using differential evolution and extreme learning machine
Bazgir et al. REFINED (REpresentation of features as images with NEighborhood Dependencies): A novel feature representation for convolutional neural networks
Chellamuthu et al. Data mining and machine learning approaches in breast cancer biomedical research
Jadhav et al. Kernel-based exponential grey wolf optimizer for rapid centroid estimation in data clustering
Özyer et al. Multi-objective genetic algorithm based clustering approach and its application to gene expression data
Endah et al. The study of synthetic minority over-sampling technique (SMOTE) and weighted extreme learning machine for handling imbalance problem on multiclass microarray classification
Kamkar et al. Exploiting feature relationships towards stable feature selection
Tian et al. Microbial Network Recovery by Compositional Graphical Lasso
Safwan et al. Classification of breast cancer and grading of diabetic retinopathy & macular edema using ensemble of pre-trained convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant