CN113160880A - lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm - Google Patents
lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm Download PDFInfo
- Publication number
- CN113160880A CN113160880A CN202110295353.9A CN202110295353A CN113160880A CN 113160880 A CN113160880 A CN 113160880A CN 202110295353 A CN202110295353 A CN 202110295353A CN 113160880 A CN113160880 A CN 113160880A
- Authority
- CN
- China
- Prior art keywords
- matrix
- lncrna
- disease
- similarity
- association
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Chemical & Material Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Crystallography & Structural Chemistry (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Analytical Chemistry (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm, which comprises the following steps of: s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS; s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation; s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS; s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method. The method introduces high-order proximity to reconstruct a similarity matrix of the lncRNA and the disease, establishes a better measurement standard to accurately describe the similarity relation between the medicines or the disease, and adopts the construction of an isomeric matrix to utilize the similarity information of the lncRNA and the disease to assist in prediction, thereby realizing more accurate lncRNA-disease association prediction.
Description
Technical Field
The invention relates to the field of combination of machine learning and biological genes, in particular to an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm.
Background
Lncrnas are a class of non-coding RNAs of more than 200 nucleotides. Numerous studies have shown that lncRNA plays a key role in many important biological processes, including translation, splicing, differentiation, epigenetic regulation, and immune responses. In recent years, scientists find that lncRNA overexpression or dysregulation is closely related to complex diseases such as various cancers, such as liver cancer (HCC), gastric cancer, breast cancer, bladder cancer, Parkinson Disease (PD) and the like. Therefore, the development of a calculation method for deducing the relation between the potential disease-lncRNA can not only accelerate the diagnosis and treatment of the disease, but also understand the mechanism of the disease from the molecular level. In addition, the time cost can be reduced by developing a calculation method, and an effective experimental direction is provided for biological research. Therefore, the identification of potential lncRNAs associated with disease is of great importance for the discovery of disease biomarkers and the treatment, diagnosis and prevention of human complex diseases. In view of the time and labor consumption of traditional experiments, the computational model can be used as an effective auxiliary tool for identifying lncRNA disease association.
Over the years, a number of computational methods have been developed to infer potential lncRNA-disease associations. The existing calculation methods can be roughly divided into three categories, namely (1) a machine learning-based method (2) a network-based method (3) a matrix completion-based method.
Machine learning-based methods generally assume that functionally similar lncrnas are associated with the same disease or closely related diseases. Unfortunately, most machine learning methods rely heavily on known labeled samples, which leads to difficulties in machine learning-based algorithms in classifying negative samples, which are difficult to obtain in practical situations, because there are usually only reports of positive lncrnas associated with disease. The use of a large number of unknown samples as negative samples may falsely classify the potential lncRNA-disease association as negative samples, which will affect the prediction accuracy of the method.
In order to reduce the burden of negative samples, the network-based lncRNA-disease association prediction method is also a popular research direction recently. If the publication date is 2019, month 01 and 18, and chinese patent publication No. CN109243538A discloses a method and system for predicting the association between diseases and LncRNA, comprising: acquiring an LncRNA-miRNA association relation and an miRNA-disease association relation from a known database, and constructing an LncRNA-miRNA-disease interaction network according to the LncRNA-miRNA association relation and the miRNA-disease association relation; constructing a disease super-expression profile and a LncRNA super-expression profile based on the LncRNA-miRNA-disease interaction network; according to the disease super-expression profile and the LncRNA super-expression profile, adopting LncRNA similarity calculation and disease similarity calculation based on the RBF neural network to train a prediction model of the correlation between the disease and the LncRNA; prediction of LncRNA-disease association pairs of candidate samples is performed using a predictive model. However, since the lncRNA-disease correlation verified by experiments is still insufficient, the prediction accuracy of the method is still affected by the fact that some lncRNA nodes and some disease nodes may have associated paths. In addition, when a new disease or lncRNA node is introduced, the network-based method faces a cold start problem, i.e., the node cannot be predicted, so that the network-based method often needs to consider the prediction of a single node or integrate additional biological information. Meanwhile, although the integration of other biological information can improve the prediction performance, such as the connection between genes and diseases, the connection between mirnas and diseases, and the like, some interactions of other biological information may include some noises that interfere with the prediction result.
The third method uses matrix completion to mine the association of lncRNA with disease. The main idea is to update LncRNA-disease adjacency matrix and recover its missing entries, assuming that the elements in the final iteration result are as close as possible to the elements in the original adjacency matrix. Compared with the other two methods, the matrix completion method can capture the overall pattern of lncRNA-disease association, reduce the false positive rate and does not need negative samples. However, existing matrix completion methods all fuse similar information of diseases and lncrnas to assist in association prediction, but all focus on using dominant similar information to predict correlations between lncrnas and diseases, such as lncRNA functional similarity, disease semantic similarity, and the like, and ignore higher-order implicit similarities between lncrnas and diseases.
The discovery of potential lncRNA-disease associations undoubtedly greatly aids in the study of understanding disease pathogenesis and developing treatments for human diseases. Because the traditional biological experiment is time-consuming and labor-consuming, an efficient and reliable calculation and prediction method is urgently needed. Therefore, the development of a calculation method to reveal unknown association of lncRNA and diseases is not only beneficial to understanding the main functions of lncRNA in the pathology and molecular change of human diseases, but also beneficial to the prognosis, treatment and prevention of complex diseases.
However, all methods applied to lncRNA-disease association prediction so far focus on linear original lncRNA and disease similarity information, although they use the similar information of disease and lncRNA to assist in association prediction. Meanwhile, since negative lncRNA-disease associated samples are difficult to obtain in practical situations, the prediction accuracy of many calculation methods requiring negative samples is affected. Also in the matrix completion algorithm, 1 in the lncRNA-disease association matrix represents a known drug-disease association and 0 represents unknown. Reasonable predictions should be in the range of 0,1, indicating the likelihood of predictive relevance. However, most of the current matrix completion methods cannot avoid the situation that the predicted value exceeds the range of [0,1], which brings difficulty to biological interpretation.
Disclosure of Invention
The invention provides an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm, which can better predict lncRNA-disease association.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm comprises the following steps:
s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS;
s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation;
s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS;
s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method.
Preferably, the lncRNA similarity matrix LS in step S1 specifically includes:
downloading lncRNA expression profile from Arrayexpress, and generating by RNA-Seq technology; on the basis of the previous research, expressing similarity of lncRNA is expressed by calculating a sperman correlation coefficient between each lncRNA pair expression profile, and expressing similarity of lncRNA li and lncRNA lj is described by a matrix LS (li, lj), wherein the similarity is between 0 and 1; the higher the similarity of the expression of lncRNA li and lncRNA lj, the higher the score.
Preferably, the disease similarity matrix DS in step S1 specifically includes:
after downloading the grid description from the national library of medicine MeSH, a model based on directed acyclic graph DAG is introduced to describe semantic similarity between diseases, and Directed Acyclic Graph (DAG) can be used to describe disease d, i.e. DAG (d), (t), (d), e (d), where t (d) is a node set and e (d) is an edge set, and for a given specific disease d, the contribution value of its ancestor node q in DAG (d) is defined as follows:
in conjunction with the contribution of its ancestor nodes in dag (d), the semantic value of disease d can be described as:
semantic similarity between two diseases can be considered higher if there are more shared nodes in the DAG for the two diseases, using the semantic similarity matrix DS (di, dj) to represent the semantic similarity between disease di and disease dj, defined as:
preferably, the step S1 calculates a high-order proximity matrix of the disease similarity matrix DS, specifically:
constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:
wherein DSnIs the n-order proximity of DS, y is the weight parameter and y is more than or equal to 0;
singular value decomposition techniques are used to improve data quality:
HD=UΣVT
wherein U is E.Rnd×ndIs a left singular vector matrix, sigma ∈ Rnd×ndIs a singular value descending diagonal matrix, V belongs to Rnd×ndIs a right singular vector matrix;
the high order proximity matrix HD is then reconstructed by keeping the k largest singular values:
whereinkFor k matrices of singular values, UkAnd VkThe top-k singular values respectively correspond to the left singular vector matrix and the right singular vector matrix.
Preferably, the calculation method of the high-order approximation matrix HL of the lncRNA similarity matrix LS is the same as the calculation method of the high-order approximation matrix HD of the disease similarity matrix DS.
Preferably, the disease-lncRNA adjacency matrix DL obtained in step S2 is specifically:
downloading lncRNA-disease association data set from LncRNADisease database, deleting repeated lncRNA, disease and non-human data in lncRNA-disease association data set, and using disease-lncRNA adjacency matrix DL epsilon Rnd×nlWhere nd and nl are the number of diseases and the number of lncRNA, respectively, the disease-lncRNA adjacency matrix DL is defined as follows:
preferably, in step S3, an isomeric disease-lncRNA association matrix is constructed, which is specifically defined as:
preferably, the matrix completion method in step S4 is used to associate the disease-lncRNA with the element with DL value 0 in the matrix T.
Preferably, in step S4, a matrix completion method is used to predict lncRNA-disease association, specifically:
let omega be the observation item X epsilon Rm×nIndex set of (1), PΩ(X)∶Rm×n→Rm×nIs a linear projection operator:
and (3) deducing missing values by assuming a low-rank matrix X by adopting a low-rank matrix completion algorithm, wherein the model is described as follows:
s.t 0≤X≤1
where ω, α are the non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1.
Preferably, in step S4, a multiplier method with alternating directions is used to transform the model into the problem to be optimized, and a variable matrix Y is introduced, where the model can be optimized as follows:
s.t X=Y,0≤Y≤1
accordingly, the augmented Lagrangian function corresponding to this equation is:
wherein Z is the standard trace inner product, beta>0 is an adaptive penalty parameter that requires alternate updates of Y in the kth iterationk+1、Xk+1And Zk+1;
Calculating Yk+1: we fix XkAnd ZkMinimizing YkIs/are as followsHandleIs denoted as PΩThe associated operator of (a) is selected,Yk+1the update is as follows:
calculating Xk+1: we anchor YkAnd ZkTo calculate Xk+1;
Based on singular value threshold algorithm, Xk+1Is represented as follows:
wherein Sτ(.) is defined asWherein τ is the contraction threshold, σdIs the d-th singular value of the matrix R, and udAnd vdRespectively corresponding left and right singular vectors;
calculating Zk+1: finally, Zk+1The calculation is as follows:
Zk+1=Zk+γβ(Xk+1-Yk+1)
where γ is a non-negative learning rate.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the method introduces high-order proximity to reconstruct a similarity matrix of the lncRNA and the disease, establishes a better measurement standard to accurately describe the similarity relation between the medicines or the diseases, adopts the construction of an isomeric matrix to utilize the similarity information of the lncRNA and the disease to assist in prediction, designs a matrix completion algorithm with limited predicted values to predict the correlation possibility of the lncRNA and the disease, and realizes more accurate correlation prediction of the lncRNA-disease.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a diagram of AUCs implementing HOPMC, GMCLDA, DSCMF, SIMCLDA, BRWLDA and RWRNCD based on leave-one-out cross-validation in the example.
FIG. 3 is a schematic diagram of AUCs implemented by HOPMC, GMCLDA, DSCMF, SIMCLDA, BRWLDA and RWRNCD based on 5-fold cross validation in the example.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a lncRNA-disease association prediction method based on a higher-order proximity and matrix completion algorithm, as shown in fig. 1, including the following steps:
s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS;
s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation;
s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS;
s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method.
The lncRNA similarity matrix LS in step S1 specifically includes:
downloading lncRNA expression profile from Arrayexpress, and generating by RNA-Seq technology; on the basis of the previous research, expressing similarity of lncRNA is expressed by calculating a sperman correlation coefficient between each lncRNA pair expression profile, and expressing similarity of lncRNA li and lncRNA lj is described by a matrix LS (li, lj), wherein the similarity is between 0 and 1; the higher the similarity of the expression of lncRNA li and lncRNA lj, the higher the score.
In step S1, the disease similarity matrix DS specifically includes:
after downloading the grid description from the national library of medicine MeSH, a model based on directed acyclic graph DAG is introduced to describe semantic similarity between diseases, and Directed Acyclic Graph (DAG) can be used to describe disease d, i.e. DAG (d), (t), (d), e (d), where t (d) is a node set and e (d) is an edge set, and for a given specific disease d, the contribution value of its ancestor node q in DAG (d) is defined as follows:
in conjunction with the contribution of its ancestor nodes in dag (d), the semantic value of disease d can be described as:
semantic similarity between two diseases can be considered higher if there are more shared nodes in the DAG for the two diseases, using the semantic similarity matrix DS (di, dj) to represent the semantic similarity between disease di and disease dj, defined as:
in step S1, calculating a high-order proximity matrix of the disease similarity matrix DS, specifically:
according to biological experimental observation, one of the basic hypotheses for lncRNA and disease prediction is that lncRNA with similar functions are often associated with phenotypically similar diseases, and vice versa. Therefore, the similarity measure of lncRNA to disease is key to predicting lncRNA-disease association. Higher order similarities may describe indirect similarity information between matrix elements, as opposed to explicit pairwise similarities. For example, in a network, if vi and vj have more common neighbors and rich path information, then vi has a high probability of reaching node vj through 2 random walks, which means that the second-order proximity of the two nodes is too high. Therefore, inferring the similarity measure of lncRNA to disease by considering higher order proximity would help us to more efficiently express similar information of lncRNA to disease. Constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:
wherein DSnIs the n-order proximity of DS, y is the weight parameter and y is more than or equal to 0;
however, due to the high dimensionality of the matrix, noise may be present in the matrix HD, and singular value decomposition techniques are employed to improve data quality, the details of SVD are as follows:
HD=UΣVT
wherein U is E.Rnd×ndIs a left singular vector matrix, sigma ∈ Rnd×ndIs a singular value descending diagonal matrix, V belongs to Rnd×ndIs a right singular vector matrix, nd is the number of diseases;
the high order proximity matrix HD is then reconstructed by keeping the k largest singular values:
whereinkFor k matrices of singular values, UkAnd VkThe top-k singular values respectively correspond to the left singular vector matrix and the right singular vector matrix.
The calculation method of the high-order proximity matrix HL of the lncRNA similarity matrix LS is the same as that of the high-order proximity matrix HD of the disease similarity matrix DS.
In step S2, acquiring the disease-lncRNA adjacency matrix DL specifically as follows:
downloading lncRNA-disease association data set comprising 687 experimentally validated lncRNA-disease associations from LncRNADisease database, deleting duplicate lncRNA, disease and non-human data therein, and finally obtaining 540 unique experimentally validated lncRNA-disease associations between 115 unique lncRNA and 178 unique diseases, using disease-lncRNA adjacency matrix DL e Rnd×nlThese associations are described, where nd and nl are the number of diseases and lncRNA, respectively, and if disease di is associated with lncRNA (j), then DL (i, j) ═ 1, otherwise 0, and disease-lncRNA adjacency matrix DL is defined as follows:
in step S3, an isomeric disease-lncRNA association matrix is constructed, which is specifically defined as:
the matrix completion method in step S4 is used to associate the disease-lncRNA with the element with DL value of 0 in the matrix T.
In step S4, a matrix completion method is used to predict lncRNA-disease association, specifically:
based on the hypothesis that functionally similar lncrnas tend to be involved in lncrnas of similar diseases and disease prediction, the underlying factors that determine the likelihood of lncrnas being associated with a disease tend to be highly correlated, which leads to the existence of correlations in the corresponding data matrix. Therefore, in the disease-lncRNA associated heterogeneous neighbor matrix T, lncRNA has a limited number of independent factors interacting with disease, which results in the formation of a low rank structure by the heterogeneous neighbor matrix T. Therefore, we used matrix completion to predict potential disease-lncRNA associations.
Let omega be the observation item X epsilon Rm×nIndex set of (1), PΩ(X)∶Rm×n→Rm×nIs a linear projection operator:
and (3) deducing missing values by assuming a low-rank matrix X by adopting a low-rank matrix completion algorithm, wherein the model is described as follows:
s.t 0≤X≤1
where ω, α are non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1, which makes the results biologically easier to interpret.
In step S4, a multiplier method in alternate directions is used to convert the model into the problem to be optimized, and a variable matrix Y is introduced, where the model may be optimized as follows:
s.t X=Y,0≤Y≤1
accordingly, the augmented Lagrangian function corresponding to this equation is:
wherein Z is the standard trace inner product, beta>0 is an adaptive penalty parameter that requires alternate updates of Y in the kth iterationk+1、Xk+1And Zk+1;
Calculating Yk+1: we fix XkAnd ZkMinimizing YkIs/are as followsHandleIs denoted as PΩThe associated operator of (a) is selected,Yk+1the update is as follows:
calculating Xk+1: we anchor YkAnd ZkTo calculate Xk+1;
Based on singular value threshold algorithm, Xk+1Is represented as follows:
wherein Sτ(.) is defined asWherein τ is the contraction threshold, σdIs the d-th singular value of the matrix R, and udAnd vdRespectively corresponding left and right singular vectors;
calculating Zk+1: finally, Zk+1The calculation is as follows:
Zk+1=Zk+γβ(Xk+1-Yk+1)
where γ is a non-negative learning rate, it may be set to 1 in this embodiment.
To examine the prediction accuracy of the method of the present embodiment (HOPMC), HOPMC was compared with 5 advanced methods GMCLDA, SIMCLDA, DSCMF, BRWLDA and RWRlncD. As can be seen from fig. 2, the area under the HOPMC curve AUC is 0.8757, which is larger than other calculation methods (GMCLDA 0.8501, SIMCLDA 0.8237, DSCMF 0.8176, BRWLDA0.7969, RWRlncD 0.6540) under the framework of one cross validation, indicating that the performance of the HOPMC is better than that of other calculation methods. To further validate the predicted performance of HOPMC, validation was performed using a 5-fold cross-validation framework. As can be seen from fig. 3, HOPMC can give a reliable AUC of 0.8353 ± 0.0045, far exceeding AUC values 0.7894 ± 0.0040, 0.7839 ± 0.0045, 0.7734 ± 0.0045, 0.7659 ± 0.0045 and 0.6179 ± 0.0045. This means that HOPMC is more efficient under the 5-fold cross-validation framework than other methods. The above results fully indicate that the HOPMC method is superior to other compared methods, and is more favorable for predicting the lncRNA-disease correlation.
HOPMC was also used to predict the utility of known lncRNA in the prediction of actual lncRNA-disease. In predicting new lncRNA-disease associations, we use the known lncRNA-disease associations as a training dataset for the HOPMC, and then calculate and rank the prediction scores for each unknown lncRNA-disease pair. We chose osteosarcoma, gastric cancer and hepatocellular carcinoma as case studies. The top 10 Cancer lncrnas were validated in the third party databases (Lnc2Cancer and MNDR). The results are shown in Table I, Table II and Table III, which indicate that 100%, 90% and 90% of the predicted lncRNA are associated with cancer
In addition, HOPMC predicted some not proven lncRNA-diseases, including MINA and osteosarcoma, PCA3 and hepatocellular carcinoma, etc. These predicted associations have not been reported in the current literature, but there is a greater likelihood that medical researchers will be available to study and validate these associations.
TABLE-HOPMC predicted potential lncRNA in the first 10 th class associated with gastric cancer
TABLE-two HOPMC predicted top 10 potential lncRNA associated with osteosarcoma
TABLE TRIHOPMC predicts the first 10 potential lncRNA associated with hepatocellular carcinoma
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm is characterized by comprising the following steps:
s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS;
s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation;
s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS;
s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method.
2. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 1, wherein the lncRNA similarity matrix LS in step S1 is specifically:
downloading lncRNA expression profile from Arrayexpress, and generating by RNA-Seq technology; on the basis of the previous research, expressing similarity of lncRNA is expressed by calculating a sperman correlation coefficient between each lncRNA pair expression profile, and expressing similarity of lncRNA li and lncRNA lj is described by a matrix LS (li, lj), wherein the similarity is between 0 and 1; the higher the similarity of the expression of lncRNA li and lncRNA lj, the higher the score.
3. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 1, wherein the disease similarity matrix DS in step S1 is specifically:
after downloading the grid description from the national library of medicine MeSH, a model based on directed acyclic graph DAG is introduced to describe semantic similarity between diseases, and Directed Acyclic Graph (DAG) can be used to describe disease d, i.e. DAG (d), (t), (d), e (d), where t (d) is a node set and e (d) is an edge set, and for a given specific disease d, the contribution value of its ancestor node q in DAG (d) is defined as follows:
in conjunction with the contribution of its ancestor nodes in dag (d), the semantic value of disease d can be described as:
semantic similarity between two diseases can be considered higher if there are more shared nodes in the DAG for the two diseases, using the semantic similarity matrix DS (di, dj) to represent the semantic similarity between disease di and disease dj, defined as:
4. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 3, wherein the higher-order proximity matrix of the disease similarity matrix DS is calculated in step S1, and specifically comprises:
constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:
wherein DSnIs the n-order proximity of DS, y is the weight parameter and y is more than or equal to 0;
singular value decomposition techniques are used to improve data quality:
HD=UΣVT
wherein U is E.Rnd×ndIs a left singular vector matrix, sigma ∈ Rnd×ndIs a singular value descending diagonal matrix, V belongs to Rnd×ndIs a right singular vector matrix;
the high order proximity matrix HD is then reconstructed by keeping the k largest singular values:
whereinkFor k matrices of singular values, UkAnd VkThe top-k singular values respectively correspond to the left singular vector matrix and the right singular vector matrix.
5. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 4, wherein the calculation method of the higher-order proximity matrix HL of the lncRNA similarity matrix LS is the same as the calculation method of the higher-order proximity matrix HD of the disease similarity matrix DS.
6. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 5, wherein the disease-lncRNA adjacency matrix DL obtained in step S2 specifically comprises:
downloading lncRNA-disease association data set from LncRNADisease database, deleting repeated lncRNA, disease and non-human data in lncRNA-disease association data set, and using disease-lncRNA adjacency matrix DL epsilon Rnd×nlWhere nd and nl are the number of diseases and the number of lncRNA, respectively, the disease-lncRNA adjacency matrix DL is defined as follows:
8. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm of claim 7, wherein the matrix completion method in step S4 is used to determine the DL value of 0 element in the disease-lncRNA association matrix T.
9. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm of claim 8, wherein the matrix completion method is adopted in step S4 to predict lncRNA-disease association, specifically:
let omega be the observation item X epsilon Rm×nIndex set of (1), PΩ(X):Rm×n→Rm×nIs a linear projection operator:
and (3) deducing missing values by assuming a low-rank matrix X by adopting a low-rank matrix completion algorithm, wherein the model is described as follows:
s.t 0≤X≤1
where ω, α are the non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1.
10. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm as claimed in claim 9, wherein in step S4, the alternative direction multiplier method is used to transform the model into the problem to be optimized, and a variable matrix Y is introduced, and the model can be optimized as follows:
s.t X=Y,0≤Y≤1
accordingly, the augmented Lagrangian function corresponding to this equation is:
wherein Z is the standard trace inner product, beta>0 is an adaptive penalty parameter that requires alternate updates of Y in the kth iterationk+1、Xk+1And Zk+1;
Calculating Yk+1: we fix XkAnd ZkMinimizing YkIs/are as followsHandleIs denoted as PΩThe associated operator of (a) is selected,Yk+1the update is as follows:
calculating Xk+1: we anchor YkAnd ZkTo calculate Xk+1;
Based on singular value threshold algorithm, Xk+1Is represented as follows:
wherein Sτ(.) is defined asWherein τ is the contraction threshold, σdIs the d-th singular value of the matrix R, and udAnd vdRespectively corresponding left and right singular vectors;
calculating Zk+1: finally, Zk+1The calculation is as follows:
Zk+1=Zk+γβ(Xk+1-Yk+1)
where γ is a non-negative learning rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110295353.9A CN113160880B (en) | 2021-03-19 | 2021-03-19 | lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110295353.9A CN113160880B (en) | 2021-03-19 | 2021-03-19 | lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113160880A true CN113160880A (en) | 2021-07-23 |
CN113160880B CN113160880B (en) | 2023-06-06 |
Family
ID=76887938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110295353.9A Active CN113160880B (en) | 2021-03-19 | 2021-03-19 | lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113160880B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096331A (en) * | 2016-06-12 | 2016-11-09 | 中南大学 | A kind of method inferring lncRNA and disease contact |
CN109243538A (en) * | 2018-07-19 | 2019-01-18 | 长沙学院 | A kind of method and system of predictive disease and LncRNA incidence relation |
CN109935332A (en) * | 2019-03-01 | 2019-06-25 | 桂林电子科技大学 | A kind of miRNA- disease association prediction technique based on double random walk models |
CN110782945A (en) * | 2019-10-22 | 2020-02-11 | 长沙学院 | Method for identifying correlation between lncRNA and disease by using indirect and direct characteristic information |
US20200208153A1 (en) * | 2018-12-28 | 2020-07-02 | The Florida International University Board Of Trustees | Long noncoding rnas in pulmonary airway inflammation |
CN112289373A (en) * | 2020-10-27 | 2021-01-29 | 齐齐哈尔大学 | lncRNA-miRNA-disease association method fusing similarity |
CN112420127A (en) * | 2020-10-26 | 2021-02-26 | 大连民族大学 | Non-coding RNA and protein interaction prediction method based on secondary structure and multi-model fusion |
-
2021
- 2021-03-19 CN CN202110295353.9A patent/CN113160880B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096331A (en) * | 2016-06-12 | 2016-11-09 | 中南大学 | A kind of method inferring lncRNA and disease contact |
CN109243538A (en) * | 2018-07-19 | 2019-01-18 | 长沙学院 | A kind of method and system of predictive disease and LncRNA incidence relation |
US20200208153A1 (en) * | 2018-12-28 | 2020-07-02 | The Florida International University Board Of Trustees | Long noncoding rnas in pulmonary airway inflammation |
CN109935332A (en) * | 2019-03-01 | 2019-06-25 | 桂林电子科技大学 | A kind of miRNA- disease association prediction technique based on double random walk models |
CN110782945A (en) * | 2019-10-22 | 2020-02-11 | 长沙学院 | Method for identifying correlation between lncRNA and disease by using indirect and direct characteristic information |
CN112420127A (en) * | 2020-10-26 | 2021-02-26 | 大连民族大学 | Non-coding RNA and protein interaction prediction method based on secondary structure and multi-model fusion |
CN112289373A (en) * | 2020-10-27 | 2021-01-29 | 齐齐哈尔大学 | lncRNA-miRNA-disease association method fusing similarity |
Non-Patent Citations (2)
Title |
---|
阳金豆 等: ""长链非编码RNA 与疾病关联关系的预测方法研究"", 《智能计算机与应用》 * |
阳金豆 等: ""长链非编码RNA 与疾病关联关系的预测方法研究"", 《智能计算机与应用》, 31 August 2020 (2020-08-31), pages 135 - 139 * |
Also Published As
Publication number | Publication date |
---|---|
CN113160880B (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Karim et al. | Deep learning-based clustering approaches for bioinformatics | |
Huang et al. | Updated review of advances in microRNAs and complex diseases: taxonomy, trends and challenges of computational models | |
Jiang et al. | Protein secondary structure prediction: A survey of the state of the art | |
Shen et al. | LPI-KTASLP: prediction of lncRNA-protein interaction by semi-supervised link learning with multivariate information | |
Lei et al. | A comprehensive survey on computational methods of non-coding RNA and disease association prediction | |
CN113241115A (en) | Depth matrix decomposition-based circular RNA disease correlation prediction method | |
Sasank et al. | An automatic tumour growth prediction based segmentation using full resolution convolutional network for brain tumour | |
Gao et al. | Graph regularized L 2, 1-nonnegative matrix factorization for miRNA-disease association prediction | |
Rahman et al. | Feature selection from colon cancer dataset for cancer classification using artificial neural network | |
CN113488104B (en) | Cancer driving gene prediction method and system based on local and global network centrality analysis | |
Zhou et al. | Predicting miRNA–Disease Associations Through Deep Autoencoder With Multiple Kernel Learning | |
Wang et al. | A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences | |
CN115995293A (en) | Circular RNA and disease association prediction method | |
CN115602243A (en) | Disease associated information prediction method based on multi-similarity fusion | |
Bhardwaj et al. | Computational biology in the lens of CNN | |
Kanwal et al. | A multimodal deep learning infused with artificial algae algorithm–An architecture of advanced E-health system for cancer prognosis prediction | |
CN113539479B (en) | Similarity constraint-based miRNA-disease association prediction method and system | |
Gao et al. | A new method based on matrix completion and non-negative matrix factorization for predicting disease-associated miRNAs | |
CN116741408A (en) | Method for multi-view self-attention prediction of drug to disease association | |
CN113160880A (en) | lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm | |
Zhao et al. | Contrastive clustering with a graph consistency constraint | |
Prabakaran et al. | Robust hyperparameter tuned deep Elman neural network for the diagnosis of osteosarcoma on histology images | |
Qiao et al. | Potential circRNA-disease association prediction using DeepWalk and nonnegative matrix factorization | |
Hu et al. | Predicting electrical evoked potential in optic nerve visual prostheses by using support vector regression and case-based prediction | |
Han et al. | Hessian Regularized L 2, 1-Nonnegative Matrix Factorization and Deep Learning for miRNA–Disease Associations Prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |