CN113160880A

CN113160880A - lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm

Info

Publication number: CN113160880A
Application number: CN202110295353.9A
Authority: CN
Inventors: 林志毅; 朱印廷; 顾国生; 孙宇平; 谢国波
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-07-23
Anticipated expiration: 2041-03-19
Also published as: CN113160880B

Abstract

The invention provides an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm, which comprises the following steps of: s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS; s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation; s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS; s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method. The method introduces high-order proximity to reconstruct a similarity matrix of the lncRNA and the disease, establishes a better measurement standard to accurately describe the similarity relation between the medicines or the disease, and adopts the construction of an isomeric matrix to utilize the similarity information of the lncRNA and the disease to assist in prediction, thereby realizing more accurate lncRNA-disease association prediction.

Description

lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm

Technical Field

The invention relates to the field of combination of machine learning and biological genes, in particular to an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm.

Background

Lncrnas are a class of non-coding RNAs of more than 200 nucleotides. Numerous studies have shown that lncRNA plays a key role in many important biological processes, including translation, splicing, differentiation, epigenetic regulation, and immune responses. In recent years, scientists find that lncRNA overexpression or dysregulation is closely related to complex diseases such as various cancers, such as liver cancer (HCC), gastric cancer, breast cancer, bladder cancer, Parkinson Disease (PD) and the like. Therefore, the development of a calculation method for deducing the relation between the potential disease-lncRNA can not only accelerate the diagnosis and treatment of the disease, but also understand the mechanism of the disease from the molecular level. In addition, the time cost can be reduced by developing a calculation method, and an effective experimental direction is provided for biological research. Therefore, the identification of potential lncRNAs associated with disease is of great importance for the discovery of disease biomarkers and the treatment, diagnosis and prevention of human complex diseases. In view of the time and labor consumption of traditional experiments, the computational model can be used as an effective auxiliary tool for identifying lncRNA disease association.

Over the years, a number of computational methods have been developed to infer potential lncRNA-disease associations. The existing calculation methods can be roughly divided into three categories, namely (1) a machine learning-based method (2) a network-based method (3) a matrix completion-based method.

Machine learning-based methods generally assume that functionally similar lncrnas are associated with the same disease or closely related diseases. Unfortunately, most machine learning methods rely heavily on known labeled samples, which leads to difficulties in machine learning-based algorithms in classifying negative samples, which are difficult to obtain in practical situations, because there are usually only reports of positive lncrnas associated with disease. The use of a large number of unknown samples as negative samples may falsely classify the potential lncRNA-disease association as negative samples, which will affect the prediction accuracy of the method.

In order to reduce the burden of negative samples, the network-based lncRNA-disease association prediction method is also a popular research direction recently. If the publication date is 2019, month 01 and 18, and chinese patent publication No. CN109243538A discloses a method and system for predicting the association between diseases and LncRNA, comprising: acquiring an LncRNA-miRNA association relation and an miRNA-disease association relation from a known database, and constructing an LncRNA-miRNA-disease interaction network according to the LncRNA-miRNA association relation and the miRNA-disease association relation; constructing a disease super-expression profile and a LncRNA super-expression profile based on the LncRNA-miRNA-disease interaction network; according to the disease super-expression profile and the LncRNA super-expression profile, adopting LncRNA similarity calculation and disease similarity calculation based on the RBF neural network to train a prediction model of the correlation between the disease and the LncRNA; prediction of LncRNA-disease association pairs of candidate samples is performed using a predictive model. However, since the lncRNA-disease correlation verified by experiments is still insufficient, the prediction accuracy of the method is still affected by the fact that some lncRNA nodes and some disease nodes may have associated paths. In addition, when a new disease or lncRNA node is introduced, the network-based method faces a cold start problem, i.e., the node cannot be predicted, so that the network-based method often needs to consider the prediction of a single node or integrate additional biological information. Meanwhile, although the integration of other biological information can improve the prediction performance, such as the connection between genes and diseases, the connection between mirnas and diseases, and the like, some interactions of other biological information may include some noises that interfere with the prediction result.

The third method uses matrix completion to mine the association of lncRNA with disease. The main idea is to update LncRNA-disease adjacency matrix and recover its missing entries, assuming that the elements in the final iteration result are as close as possible to the elements in the original adjacency matrix. Compared with the other two methods, the matrix completion method can capture the overall pattern of lncRNA-disease association, reduce the false positive rate and does not need negative samples. However, existing matrix completion methods all fuse similar information of diseases and lncrnas to assist in association prediction, but all focus on using dominant similar information to predict correlations between lncrnas and diseases, such as lncRNA functional similarity, disease semantic similarity, and the like, and ignore higher-order implicit similarities between lncrnas and diseases.

The discovery of potential lncRNA-disease associations undoubtedly greatly aids in the study of understanding disease pathogenesis and developing treatments for human diseases. Because the traditional biological experiment is time-consuming and labor-consuming, an efficient and reliable calculation and prediction method is urgently needed. Therefore, the development of a calculation method to reveal unknown association of lncRNA and diseases is not only beneficial to understanding the main functions of lncRNA in the pathology and molecular change of human diseases, but also beneficial to the prognosis, treatment and prevention of complex diseases.

However, all methods applied to lncRNA-disease association prediction so far focus on linear original lncRNA and disease similarity information, although they use the similar information of disease and lncRNA to assist in association prediction. Meanwhile, since negative lncRNA-disease associated samples are difficult to obtain in practical situations, the prediction accuracy of many calculation methods requiring negative samples is affected. Also in the matrix completion algorithm, 1 in the lncRNA-disease association matrix represents a known drug-disease association and 0 represents unknown. Reasonable predictions should be in the range of 0,1, indicating the likelihood of predictive relevance. However, most of the current matrix completion methods cannot avoid the situation that the predicted value exceeds the range of [0,1], which brings difficulty to biological interpretation.

Disclosure of Invention

The invention provides an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm, which can better predict lncRNA-disease association.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm comprises the following steps:

s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS;

s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation;

s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS;

s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method.

Preferably, the lncRNA similarity matrix LS in step S1 specifically includes:

downloading lncRNA expression profile from Arrayexpress, and generating by RNA-Seq technology; on the basis of the previous research, expressing similarity of lncRNA is expressed by calculating a sperman correlation coefficient between each lncRNA pair expression profile, and expressing similarity of lncRNA li and lncRNA lj is described by a matrix LS (li, lj), wherein the similarity is between 0 and 1; the higher the similarity of the expression of lncRNA li and lncRNA lj, the higher the score.

Preferably, the disease similarity matrix DS in step S1 specifically includes:

after downloading the grid description from the national library of medicine MeSH, a model based on directed acyclic graph DAG is introduced to describe semantic similarity between diseases, and Directed Acyclic Graph (DAG) can be used to describe disease d, i.e. DAG (d), (t), (d), e (d), where t (d) is a node set and e (d) is an edge set, and for a given specific disease d, the contribution value of its ancestor node q in DAG (d) is defined as follows:

in conjunction with the contribution of its ancestor nodes in dag (d), the semantic value of disease d can be described as:

semantic similarity between two diseases can be considered higher if there are more shared nodes in the DAG for the two diseases, using the semantic similarity matrix DS (di, dj) to represent the semantic similarity between disease di and disease dj, defined as:

preferably, the step S1 calculates a high-order proximity matrix of the disease similarity matrix DS, specifically:

constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:

wherein DSⁿIs the n-order proximity of DS, y is the weight parameter and y is more than or equal to 0;

singular value decomposition techniques are used to improve data quality:

HD＝UΣV^T

wherein U is E.R^nd×ndIs a left singular vector matrix, sigma ∈ R^nd×ndIs a singular value descending diagonal matrix, V belongs to R^nd×ndIs a right singular vector matrix;

the high order proximity matrix HD is then reconstructed by keeping the k largest singular values:

wherein_kFor k matrices of singular values, U_kAnd V_kThe top-k singular values respectively correspond to the left singular vector matrix and the right singular vector matrix.

Preferably, the calculation method of the high-order approximation matrix HL of the lncRNA similarity matrix LS is the same as the calculation method of the high-order approximation matrix HD of the disease similarity matrix DS.

Preferably, the disease-lncRNA adjacency matrix DL obtained in step S2 is specifically:

downloading lncRNA-disease association data set from LncRNADisease database, deleting repeated lncRNA, disease and non-human data in lncRNA-disease association data set, and using disease-lncRNA adjacency matrix DL epsilon R^nd×nlWhere nd and nl are the number of diseases and the number of lncRNA, respectively, the disease-lncRNA adjacency matrix DL is defined as follows:

preferably, in step S3, an isomeric disease-lncRNA association matrix is constructed, which is specifically defined as:

preferably, the matrix completion method in step S4 is used to associate the disease-lncRNA with the element with DL value 0 in the matrix T.

Preferably, in step S4, a matrix completion method is used to predict lncRNA-disease association, specifically:

let omega be the observation item X epsilon R^m×nIndex set of (1), P_Ω(X)∶R^m×n→R^m×nIs a linear projection operator:

and (3) deducing missing values by assuming a low-rank matrix X by adopting a low-rank matrix completion algorithm, wherein the model is described as follows:

s.t 0≤X≤1

where ω, α are the non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1.

Preferably, in step S4, a multiplier method with alternating directions is used to transform the model into the problem to be optimized, and a variable matrix Y is introduced, where the model can be optimized as follows:

s.t X＝Y,0≤Y≤1

accordingly, the augmented Lagrangian function corresponding to this equation is:

wherein Z is the standard trace inner product, beta>0 is an adaptive penalty parameter that requires alternate updates of Y in the kth iteration_k+1、X_k+1And Z_k+1；

Calculating Y_k+1: we fix X_kAnd Z_kMinimizing Y_kIs/are as follows

Handle

Is denoted as P_ΩThe associated operator of (a) is selected,

Y_k+1the update is as follows:

calculating X_k+1: we anchor Y_kAnd Z_kTo calculate X_k+1；

Based on singular value threshold algorithm, X_k+1Is represented as follows:

wherein S_τ(.) is defined as

Wherein τ is the contraction threshold, σ_dIs the d-th singular value of the matrix R, and u_dAnd v_dRespectively corresponding left and right singular vectors;

calculating Z_k+1: finally, Z_k+1The calculation is as follows:

Z_k+1＝Z_k+γβ(X_k+1-Y_k+1)

where γ is a non-negative learning rate.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the method introduces high-order proximity to reconstruct a similarity matrix of the lncRNA and the disease, establishes a better measurement standard to accurately describe the similarity relation between the medicines or the diseases, adopts the construction of an isomeric matrix to utilize the similarity information of the lncRNA and the disease to assist in prediction, designs a matrix completion algorithm with limited predicted values to predict the correlation possibility of the lncRNA and the disease, and realizes more accurate correlation prediction of the lncRNA-disease.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of AUCs implementing HOPMC, GMCLDA, DSCMF, SIMCLDA, BRWLDA and RWRNCD based on leave-one-out cross-validation in the example.

FIG. 3 is a schematic diagram of AUCs implemented by HOPMC, GMCLDA, DSCMF, SIMCLDA, BRWLDA and RWRNCD based on 5-fold cross validation in the example.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a lncRNA-disease association prediction method based on a higher-order proximity and matrix completion algorithm, as shown in fig. 1, including the following steps:

The lncRNA similarity matrix LS in step S1 specifically includes:

In step S1, the disease similarity matrix DS specifically includes:

in step S1, calculating a high-order proximity matrix of the disease similarity matrix DS, specifically:

according to biological experimental observation, one of the basic hypotheses for lncRNA and disease prediction is that lncRNA with similar functions are often associated with phenotypically similar diseases, and vice versa. Therefore, the similarity measure of lncRNA to disease is key to predicting lncRNA-disease association. Higher order similarities may describe indirect similarity information between matrix elements, as opposed to explicit pairwise similarities. For example, in a network, if vi and vj have more common neighbors and rich path information, then vi has a high probability of reaching node vj through 2 random walks, which means that the second-order proximity of the two nodes is too high. Therefore, inferring the similarity measure of lncRNA to disease by considering higher order proximity would help us to more efficiently express similar information of lncRNA to disease. Constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:

however, due to the high dimensionality of the matrix, noise may be present in the matrix HD, and singular value decomposition techniques are employed to improve data quality, the details of SVD are as follows:

HD＝UΣV^T

wherein U is E.R^nd×ndIs a left singular vector matrix, sigma ∈ R^nd×ndIs a singular value descending diagonal matrix, V belongs to R^nd×ndIs a right singular vector matrix, nd is the number of diseases;

The calculation method of the high-order proximity matrix HL of the lncRNA similarity matrix LS is the same as that of the high-order proximity matrix HD of the disease similarity matrix DS.

In step S2, acquiring the disease-lncRNA adjacency matrix DL specifically as follows:

downloading lncRNA-disease association data set comprising 687 experimentally validated lncRNA-disease associations from LncRNADisease database, deleting duplicate lncRNA, disease and non-human data therein, and finally obtaining 540 unique experimentally validated lncRNA-disease associations between 115 unique lncRNA and 178 unique diseases, using disease-lncRNA adjacency matrix DL e R^nd×nlThese associations are described, where nd and nl are the number of diseases and lncRNA, respectively, and if disease di is associated with lncRNA (j), then DL (i, j) ═ 1, otherwise 0, and disease-lncRNA adjacency matrix DL is defined as follows:

in step S3, an isomeric disease-lncRNA association matrix is constructed, which is specifically defined as:

the matrix completion method in step S4 is used to associate the disease-lncRNA with the element with DL value of 0 in the matrix T.

In step S4, a matrix completion method is used to predict lncRNA-disease association, specifically:

based on the hypothesis that functionally similar lncrnas tend to be involved in lncrnas of similar diseases and disease prediction, the underlying factors that determine the likelihood of lncrnas being associated with a disease tend to be highly correlated, which leads to the existence of correlations in the corresponding data matrix. Therefore, in the disease-lncRNA associated heterogeneous neighbor matrix T, lncRNA has a limited number of independent factors interacting with disease, which results in the formation of a low rank structure by the heterogeneous neighbor matrix T. Therefore, we used matrix completion to predict potential disease-lncRNA associations.

s.t 0≤X≤1

where ω, α are non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1, which makes the results biologically easier to interpret.

In step S4, a multiplier method in alternate directions is used to convert the model into the problem to be optimized, and a variable matrix Y is introduced, where the model may be optimized as follows:

s.t X＝Y,0≤Y≤1

Calculating Y_k+1: we fix X_kAnd Z_kMinimizing Y_kIs/are as follows

Handle

Is denoted as P_ΩThe associated operator of (a) is selected,

Y_k+1the update is as follows:

calculating X_k+1: we anchor Y_kAnd Z_kTo calculate X_k+1；

Based on singular value threshold algorithm, X_k+1Is represented as follows:

wherein S_τ(.) is defined as

calculating Z_k+1: finally, Z_k+1The calculation is as follows:

Z_k+1＝Z_k+γβ(X_k+1-Y_k+1)

where γ is a non-negative learning rate, it may be set to 1 in this embodiment.

To examine the prediction accuracy of the method of the present embodiment (HOPMC), HOPMC was compared with 5 advanced methods GMCLDA, SIMCLDA, DSCMF, BRWLDA and RWRlncD. As can be seen from fig. 2, the area under the HOPMC curve AUC is 0.8757, which is larger than other calculation methods (GMCLDA 0.8501, SIMCLDA 0.8237, DSCMF 0.8176, BRWLDA0.7969, RWRlncD 0.6540) under the framework of one cross validation, indicating that the performance of the HOPMC is better than that of other calculation methods. To further validate the predicted performance of HOPMC, validation was performed using a 5-fold cross-validation framework. As can be seen from fig. 3, HOPMC can give a reliable AUC of 0.8353 ± 0.0045, far exceeding AUC values 0.7894 ± 0.0040, 0.7839 ± 0.0045, 0.7734 ± 0.0045, 0.7659 ± 0.0045 and 0.6179 ± 0.0045. This means that HOPMC is more efficient under the 5-fold cross-validation framework than other methods. The above results fully indicate that the HOPMC method is superior to other compared methods, and is more favorable for predicting the lncRNA-disease correlation.

HOPMC was also used to predict the utility of known lncRNA in the prediction of actual lncRNA-disease. In predicting new lncRNA-disease associations, we use the known lncRNA-disease associations as a training dataset for the HOPMC, and then calculate and rank the prediction scores for each unknown lncRNA-disease pair. We chose osteosarcoma, gastric cancer and hepatocellular carcinoma as case studies. The top 10 Cancer lncrnas were validated in the third party databases (Lnc2Cancer and MNDR). The results are shown in Table I, Table II and Table III, which indicate that 100%, 90% and 90% of the predicted lncRNA are associated with cancer

In addition, HOPMC predicted some not proven lncRNA-diseases, including MINA and osteosarcoma, PCA3 and hepatocellular carcinoma, etc. These predicted associations have not been reported in the current literature, but there is a greater likelihood that medical researchers will be available to study and validate these associations.

TABLE-HOPMC predicted potential lncRNA in the first 10 th class associated with gastric cancer

TABLE-two HOPMC predicted top 10 potential lncRNA associated with osteosarcoma

TABLE TRIHOPMC predicts the first 10 potential lncRNA associated with hepatocellular carcinoma

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm is characterized by comprising the following steps:

2. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 1, wherein the lncRNA similarity matrix LS in step S1 is specifically:

3. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 1, wherein the disease similarity matrix DS in step S1 is specifically:

4. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 3, wherein the higher-order proximity matrix of the disease similarity matrix DS is calculated in step S1, and specifically comprises:

singular value decomposition techniques are used to improve data quality:

HD＝UΣV^T

5. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 4, wherein the calculation method of the higher-order proximity matrix HL of the lncRNA similarity matrix LS is the same as the calculation method of the higher-order proximity matrix HD of the disease similarity matrix DS.

6. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 5, wherein the disease-lncRNA adjacency matrix DL obtained in step S2 specifically comprises:

7. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 6, wherein an isomeric disease-lncRNA association matrix is constructed in step S3, and is specifically defined as:

8. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm of claim 7, wherein the matrix completion method in step S4 is used to determine the DL value of 0 element in the disease-lncRNA association matrix T.

9. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm of claim 8, wherein the matrix completion method is adopted in step S4 to predict lncRNA-disease association, specifically:

let omega be the observation item X epsilon R^m×nIndex set of (1), P_Ω(X)：R^m×n→R^m×nIs a linear projection operator:

s.t 0≤X≤1

10. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm as claimed in claim 9, wherein in step S4, the alternative direction multiplier method is used to transform the model into the problem to be optimized, and a variable matrix Y is introduced, and the model can be optimized as follows: