CN112951320B

CN112951320B - Biomedical network association prediction method based on ensemble learning

Info

Publication number: CN112951320B
Application number: CN202110236007.3A
Authority: CN
Inventors: 欧阳乐; 卢帆
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2023-05-16
Anticipated expiration: 2041-03-03
Also published as: CN112951320A

Abstract

Aiming at the limitations of the prior art, the invention provides a biomedical network association prediction method based on ensemble learning, which introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.

Description

Biomedical network association prediction method based on ensemble learning

Technical Field

The invention relates to the technical field of computational biology, in particular to biological data mining, and more particularly relates to a biomedical network association prediction method based on ensemble learning.

Background

The development of complex diseases such as cancer often results not from deregulation and mutation of individual biomolecules but from dysfunctions of regulatory networks composed of interactions between biomolecules. In the course of disease occurrence and development, abnormal behaviors of some biomolecules occur, and the identification of abnormal biomolecules with high correlation with disease occurrence is very helpful for the prevention, diagnosis and treatment of diseases. In recent years, many studies have been made to verify the association between different types of biological entities, such as association of drugs with protein targets, association of diseases with micro RNAs, association of diseases with Long non-coding RNAs (lncrnas), and the like, through biological experiments. However, identifying new biomedical associations by biological experimentation requires a significant amount of time and is costly. In the computer field, such problems can be abstracted as associated predictive problems for a two-way network. The conceptual diagram of the biomedical binary network can be seen in fig. 1. Thus, predicting potential correlation networks by computational methods, thereby providing references and suggestions for biological experiments, will help to improve the efficiency of biomedical correlation recognition of this task and reduce costs.

In the last decade, various computing methods have emerged that are applied to biomedical network-related prediction tasks. According to the principle, three kinds of methods can be roughly classified: network diffusion model, feature-based classification method, and matrix decomposition-based method. The network diffusion model mainly uses a graph-based method to carry out diffusion propagation on the association in the biomedical network, so as to predict the potential association in the network. The feature-based classification method is to represent each association by the features of nodes of both the association parties, and then input the association into a machine learning model for training. Matrix decomposition-based methods attempt to learn two or more low-dimensional factor matrices from biomedical correlation matrices and then multiply them to reconstruct a correlation matrix. However, in the face of biomedical association networks of varying types, assumptions that rely solely on a single predictive approach may not accurately characterize all data.

Publication date 2020.04.10, publication number: chinese invention patent CN110993113 a: the method and the system for predicting the lncRNA-disease relationship based on MF-SDAE attempt to extract various characteristics of lncRNA and various characteristics of diseases by utilizing a plurality of lncRNA databases and a plurality of disease databases so as to provide a rapid and effective scheme, but the scheme still has certain limitations.

Disclosure of Invention

Aiming at the limitation of the prior art, the invention provides a biomedical network association prediction method based on ensemble learning, which adopts the following technical scheme:

a biomedical network association prediction method based on ensemble learning comprises the following steps:

s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; respectively applying a plurality of biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;

s2, respectively calculating and obtaining a prediction similarity matrix of the biological entity according to the prediction results of the algorithms; after the sparse processing is carried out on the prediction similarity matrix, the integrated similarity matrix of the biological entity is obtained by combining the prediction results of all algorithms and calculating in a weighted superposition mode;

s3, extracting low-dimensional features from the original similarity matrix by utilizing singular value decomposition, and constructing a self-adaptive weighted integrated matrix decomposition model by combining the integrated similarity matrix;

s4, training and optimizing the self-adaptive weighting integrated matrix decomposition model until the model converges;

s5, reconstructing a prediction matrix by using the converged self-adaptive weighting integrated matrix decomposition model to serve as a final result of the biological entity association prediction.

Compared with the prior art, the method introduces the information of the prediction results of a plurality of algorithms, thereby extracting the strong connection inside two types of biomedical entities and constructing corresponding low-dimensional characteristics; further learning the relation between the two types of low-dimensional features through a matrix decomposition model to explain the observed biomedical association, and reconstructing the two parts of low-dimensional features by a final model to obtain an integrated prediction result; the method can break through the limitation of a single method, synthesizes the visual angles of various algorithms, and provides more accurate and more robust prediction results.

As a preferred embodiment, in the step S2, the predicted similarity matrix of the biological entity is obtained by calculation according to the following formula

and

Wherein a= { a ₁ ,a ₂ ,…,a _m B= { B ₁ ,b ₂ ,…,b _n Respectively representing the collection of biological entities; y is Y ^(l) (a _i) and Y^(l) (b _i ) Respectively representing the predicted result Y of the first algorithm ^(l) Is the (a) th _i Individual row vectors, b _i A plurality of column vectors; the parameters controlling the bandwidth of the function are set as

and

Further, in the step S2, the prediction similarity matrix is thinned by the following formula:

wherein ,N(a_i ) Represents a _i Neighbor set of (b), N (b) _i ) Represents b _i Is a neighbor set of the node (a).

Further, in said step S2 an integrated similarity matrix G of said biological entity is obtained by calculation according to the formula _AS G _BS ：

wherein ,

and

Is the weight of adaptive learning.

Further, in the step S3, the original similarity matrix S is mapped by the following formula _A and S_B Extracting low-dimensional features F _A and F_B ：

Wherein the dimension of the low-dimensional feature is set to f _A (f _A<m) and f_B (f _B <n)。

As a preferred scheme, the adaptive weighted integration matrix decomposition model is expressed by the following formula:

wherein ,G_AS F _A Form a characteristic representation of a class A biological entity, G _BS F _B A characteristic representation of the class B biological entity is constituted,

and

Representing A, B two classes of biological entities projected onto a shared k (k.ltoreq.min (f) _A ,f _B ) Embedding matrix in dimensional space, u _i Represents the ith row vector of U, v _j Represents the j-th row vector of V.

Further, in the step S4, training and optimizing the adaptive weighted integration matrix decomposition model is implemented by solving the following objective function:

wherein M is the number of biomedical network association prediction algorithms.

Further, in the process of each iteration update in the step S4, the optimization variables U, V are updated alternately in turn,

Further, in the step S4, the optimization variables U, V are updated alternately in sequence,

The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows:

wherein, the ". Iy represents the Hadamard product operator between the matrices;

the updated formula of (c) is as follows:

further, in the step S4, the convergence condition for training and optimizing the adaptive weighted integration matrix decomposition model is:

wherein ,L(^k ) Representing the value of the objective function at the kth iteration.

Drawings

FIG. 1 is a conceptual diagram of a biomedical binary network;

FIG. 2 is a flowchart of steps of a biomedical network association prediction method based on ensemble learning according to an embodiment of the present invention;

FIG. 3 is a logic diagram of a biomedical network association prediction method based on ensemble learning according to an embodiment of the present invention;

FIG. 4 is a graph of AUC over lncRNADisease2015 dataset as a function of parameter k in an evaluation experiment in accordance with an embodiment of the present invention;

FIG. 5 is a graph of AUC over lncRNADisease2015 dataset as a function of parameter λ in an evaluation experiment in accordance with an embodiment of the present invention;

FIG. 6 is a graph showing the dependence of the parameter lambda on the lncRNADisease2015 dataset in an evaluation experiment according to an embodiment of the present invention _w AUC curve of the variation of (a).

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the embodiments of the present application, are within the scope of the embodiments of the present application.

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims. In the description of this application, it should be understood that the terms "first," "second," "third," and the like are used merely to distinguish between similar objects and are not necessarily used to describe a particular order or sequence, nor should they be construed to indicate or imply relative importance. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.

Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The invention is further illustrated in the following figures and examples.

In order to solve the limitations of the prior art, the present embodiment provides a technical solution, and the technical solution of the present invention is further described below with reference to the drawings and the embodiments.

Referring to fig. 2, the biomedical network association prediction method based on ensemble learning includes the following steps:

Specifically, similar techniques such as MF-SDAE, although also incorporating data from multiple sources, employ matrix decomposition methods. In the way of integrating multi-source data, MF-SDAE integrates multi-source data by simply stacking feature matrices; the method adopts a self-adaptive weighting superposition mode, can automatically adjust the weight among algorithms in the model optimization process, and can be more flexibly adapted to different data, so that the prediction is more robust. Meanwhile, compared with an MF-SDAE matrix decomposition method, the embodiment of the invention uses a Logistic matrix decomposition method for modeling, which is more in line with the characteristic that biomedical associated data has only 01 binary values, and can more accurately describe biomedical associated data.

In general, in biomedical networks, in generalInvestigation of the association between two classes of biological entities, such as drugs, protein targets, mirnas, etc., can use two types of nodes a= { a ₁ ,a ₂ ,…,a _m} and B＝{b₁ ,b ₂ ,…,b _n To represent two sets of biological entities, respectively. The association matrix Y epsilon {0,1} ^m×n Representing a known biomedical association, when Y _ij When=1, represent a _i And b _j There is an association between them; when Y is _ij When=0, represent a _i And b _j The correlation between the two is unknown. The purpose of biomedical network association prediction algorithms is to predict the association pair with the highest probability of association among the unknown association pairs. In biomedical association prediction tasks, the original similarity matrix of biological entities is used in addition to the known association matrix

And

as input. Then, the prediction results correspondingly calculated with M algorithms can be expressed as { Y } ⁽¹⁾ ,Y ⁽²⁾ ,…,Y ^(M) }。

In the prediction matrix reconstructed in step S5, the larger the element values, the greater the probability that the corresponding biomedical association pair is a potential association.

As a preferred embodiment, in said step S2 a prediction similarity matrix for said biological entity is obtained by calculation of the following formula

and

and

Specifically, the above steps consist in predicting the result Y for the first algorithm ^(l) Constructing a prediction similarity matrix according to the row and column directions respectively

and

The construction method is the same as Guassian Interaction Profile method.

Specifically, through the steps, K neighbors are used for sparsifying each prediction similarity matrix, so that weak links possibly with noise can be filtered while strong links in a network are maintained.

wherein ,

and

Is the weight of adaptive learning.

Specifically, in the above steps, the integration is performed by adopting a weighted superposition manner, so that the consistent information and the complementary information between the prediction results of each algorithm can be effectively utilized.

In particular, the low-dimensional features obtained in the above steps will provide information of the original similarity matrix for the matrix factorization model of the next step, while providing a compact representation.

As a preferred embodiment, the integrated similarity matrix and low-dimensional features may be used to construct two low-dimensional feature spaces for two classes of biomedical entities a and B, respectively, and to find the essential links between the features in the two feature spaces. On the basis of logic matrix decomposition, the self-adaptive weighting integrated matrix decomposition model is expressed according to the following formula:

and

Specifically, since only two values of 0 and 1 are known for the elements in the correlation matrix Y, the present embodiment fits the observed data with the bernoulli distribution. In biomedical network association prediction problems, positive examples are verified biomedical association pairs, and negative examples are association pairs with unknown association conditions, so that positive examples have higher reliability than negative examples. To emphasize the role of the positive examples during the training process, one positive example can be regarded as c (c > 1) positive examples to train during the training process. This parameter is set in the model to the default parameter c=5. Assuming that each training sample is independent, the conditional probability of the observed data is:

assuming that U, V all obey a gaussian distribution of zero mean over the prior:

wherein I is an identity matrix. From bayesian inference, the posterior probability of the model parameters U, V can be obtained as follows:

subsequently, by maximizing the log posterior probability, the following objective function can be equivalently established:

wherein

The present embodiment is to adaptively adjust the weights between algorithms +.>

and

The weights are also added as optimization variables to the objective function; the present embodiment introduces entropy regularization term for the objective function to control the distribution of weights to prevent the weights from overfitting to a certain algorithm.

Therefore, after further sorting, in the step S4, training and optimizing the adaptive weighted integration matrix decomposition model is achieved by solving the following objective function:

Specifically, in the process of updating the optimized variables, the embodiment fixes three other variables when solving each variable, and sequentially performs optimization solving on 4 variables in this way.

wherein, the ". Iy represents the Hadamard product operator between the matrices; specifically, the variables U and V can be updated with an adaptive gradient descent optimizer Adagrad;

the updated formula of (c) is as follows: />

The biomedical network association prediction method based on the integrated learning of the present embodiment will be described below with reference to specific evaluation experiments:

first, the present embodiment performs an evaluation experiment on two different types of biomedical association data sets. Wherein the Enzyme dataset is a biomedical association dataset describing drug target interactions, the two types of biomedical entities in the dataset being drug and protein targets; the Enzyme dataset contained 445 drugs, 664 protein targets, 2926 drug targets correlated to that; it provides not only the known drug target associations, but also a drug structural similarity matrix and a protein sequence similarity matrix; the acquisition of the association information in the Enzyme dataset is derived from four databases of KEGG BRITE, BRENDA, superTarget and drug Bank. Whereas the lncrrnadisease 2015 dataset is a bio-associated dataset describing interactions between lncRNA and disease, two types of biomedical entities in the dataset are lncRNA and disease; the LncRNADisease2015 dataset contains 285 lnrnas, 226 diseases, 621 lncRNA-disease associations; which provides both lncRNA similarity and functional similarity of disease; the LncRNADisease2015 dataset was obtained by searching the LncRNADisease database of 2015 version, filtering duplicate lncRNA-disease association records.

The biomedical network association prediction method based on the integrated learning needs the prediction results of a plurality of algorithms as input. The evaluation experiment of this embodiment relates specifically to seven algorithms GRMF, NRLMF, KBMF, CMF, SIMCLDA, BLMNII, netLapRLS applied to biomedical associative prediction tasks. In the experiment, the prediction results of the seven algorithms are integrated to perform comprehensive prediction, and the parameter setting of each algorithm refers to the default parameters in the original paper.

Regarding the selection of the verification method and the evaluation index, the evaluation experiment of the embodiment adopts a ten-fold cross verification mode, the drug target incidence matrix Y is divided into ten mutually disjoint subsets on average, one subset is taken as a test set in turn, the rest subset is taken as a training set, the elements belonging to the test set in the incidence matrix Y are set to be 0, and the elements belonging to the training set are kept unchanged, so that the training data of each fold cross verification is constructed. In each fold cross-validation, the model inputs training data to obtain a corresponding predictive probability matrix. In order to evaluate whether the predicted value of the test set sample in the predicted probability matrix accords with the label of the test set sample in the known correlation matrix Y, and the AUC is selected as an evaluation index, the specific calculation method is as follows:

calculation method of evaluation index AUC. For a two-classification problem, the sample can be divided into positive (positive) and negative (negative), and the labels "1" and "0" generally represent positive and negative samples, respectively. After the classification prediction is performed, four cases occur:

(1) If a sample is a Positive example and is predicted to be a Positive example, the sample is classified as a True Positive (TP);

(2) If a sample is positive but predicted to be Negative, the sample is classified as a False Negative (FN);

(3) If a sample is negative, but predicted to be Positive, the sample is classified as a False Positive (FP);

(4) If a sample is Negative and is predicted to be Negative, the sample is classified as True Negative (TN).

True positive rate: tpr=tp/(tp+fn);

false positive rate: fpr=fp/(fp+tn);

and drawing an ROC curve by taking TPR as a y axis and FPR as an x axis, wherein an AUC value is the area surrounded by the ROC curve and a coordinate axis x and y. The larger the AUC value, the better the predictive performance of the classifier.

Considering the effect of random initialization of the optimization variables and data set partitioning, we repeatedly performed five experiments using different random seeds.

Regarding parameter setting and effect evaluation, partial parameters of the adaptive weighting integrated matrix decomposition model are set to empirical values, in addition, the training weight of a positive sample c=5, the KNN neighbor number k=30, and the low-dimensional feature dimension f is extracted _A＝100 and f_B =100. For the super-parameters in the model, the dimension k, L2 regularization term coefficient lambda and entropy regularization term coefficient lambda of the matrix decomposition _w The evaluation experiment of this embodiment finds the optimal parameters by using the grid search method: the super parameter range is set to k epsilon {10,20,30,40,50}, lambda epsilon {2 } ^-3 ,2 ^-2 ,2 ^-1 ,2 ⁰ ,2 ¹ ,2 ² ,2 ³ }，λ _w ∈{2 ^-3 ,2 ^-2 ,2 ^-1 ,2 ⁰ ,2 ¹ ,2 ² ,2 ³ }. The evaluation experiments of this example performed a parameter sensitivity analysis on the lncrna disease2015 dataset for these three super parameters. As can be seen from fig. 4, the lower dimensionality of the matrix decomposition enables the model to achieve better predictive performance. As seen from fig. 5, the predictive performance of the model is optimized at the parameter L2 regularization coefficient λ=1. For entropy regularization term coefficient lambda _w The higher the value, the more uneven the weights between algorithms, the more heavily the advantageous algorithm will get, whereas the weaker algorithm will get less heavily. FIG. 6 is an experimental result showing the proper improvement of the entropy regularization term coefficient λ _w The model can be given a higher AUC score, which also illustrates that the weight is weighted directly compared to

and

The design of the adaptive weights in the model can effectively improve the prediction performance of the model by setting the model as the average weight.

Five times of repeated ten-fold cross-validation experiments are carried out on two data sets of Enzyme and LncRNADisease2015, an integrated matrix decomposition method under the optimal parameters is compared with all methods participating in integration such as GRMF, and AUC is uniformly used as an evaluation index, wherein the experimental results are as follows:

	Enzyme	lncRNADisease2015
			GRMF	0.9655±0.002587	0.755601±0.008869
NRLMF	0.976221±0.001731	0.787295±0.008070
			KBMF	0.89816±0.002788	0.776094±0.012368
CMF	0.92171±0.011913	0.719111±0.017388
			SIMCLDA	0.791411±0.004790	0.835838±0.008980
BLMNII	0.965645±0.006894	0.710114±0.025129
			NetlapRLS	0.950058±0.002805	0.781431±0.009761
EnsembleMF	0.980072±0.001393	0.880177±0..004781

it can be seen that the AUC score for our integrated method can be higher than the performance of the method involved in the integration on both the Enzyme and lncrrnadisease 2015 data sets. The method for performing the associative prediction of the biomedical network based on the integrated learning can effectively integrate the advantages of different algorithms to perform the associative prediction.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The biomedical network association prediction method based on the ensemble learning is characterized by comprising the following steps of:

s1, acquiring original similarity matrixes and correlation matrixes of two types of biological entities to be predicted; selecting M existing biomedical network association prediction algorithms, and carrying out association prediction on the biological entities according to the original similarity matrix and the association matrix to obtain prediction results of the algorithms;

2. The ensemble learning-based biomedical network association prediction method according to claim 1, wherein the prediction similarity matrix of the biological entity is obtained by calculation in the step S2 by the following formula

and

and

3. The biomedical network association prediction method based on ensemble learning according to claim 2, wherein in said step S2, the prediction similarity matrix is thinned by the following formula:

4. The ensemble learning based biomedical network association prediction method according to claim 3, wherein in said step S2, the ensemble similarity moment of the biological entity is obtained by calculation of the following formulaArray G _AS G _BS ：

wherein ,

and

Is the weight of adaptive learning.

5. The ensemble learning-based biomedical network association prediction method according to claim 4, wherein said original similarity matrix S is applied in said step S3 by the following formula _A and S_B Extracting low-dimensional features F _A and F_B ：

Wherein the dimension of the low-dimensional feature is set to f _A F _B ，f _A <m，f _B <n。

6. The ensemble learning-based biomedical network association prediction method as claimed in claim 5, wherein the adaptive weighted integration matrix decomposition model is expressed as follows:

and

Representing an embedding matrix of projections of two classes of biological entities A, B into shared k-dimensional space, k.ltoreq.min (f _A ,f _B )，u _i Represents the ith row vector of U, v _j Represents the j-th row vector of V.

7. The ensemble learning-based biomedical network association prediction method as claimed in claim 6, wherein in step S4, training optimization of the adaptive weighted integration matrix decomposition model is achieved by solving the following objective function:

8. The biomedical network association prediction method based on integrated learning according to claim 7, wherein in the iterative updating process of each round of step S4, the optimization variables U, V are updated alternately in turn,

9. Root of Chinese characterThe biomedical network association prediction method based on ensemble learning according to claim 8, wherein the optimization variables U, V are updated alternately in the sequence of step S4,

The partial derivatives of the objective function L with respect to the variables U and V are expressed as follows: />

the updated formula of (c) is as follows:

10. the method according to claim 9, wherein the convergence condition for training and optimizing the adaptive weighted integration matrix decomposition model in step S4 is: