CN113160880A - lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm - Google Patents

lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm Download PDF

Info

Publication number
CN113160880A
CN113160880A CN202110295353.9A CN202110295353A CN113160880A CN 113160880 A CN113160880 A CN 113160880A CN 202110295353 A CN202110295353 A CN 202110295353A CN 113160880 A CN113160880 A CN 113160880A
Authority
CN
China
Prior art keywords
matrix
lncrna
disease
similarity
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110295353.9A
Other languages
Chinese (zh)
Other versions
CN113160880B (en
Inventor
林志毅
朱印廷
顾国生
孙宇平
谢国波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110295353.9A priority Critical patent/CN113160880B/en
Publication of CN113160880A publication Critical patent/CN113160880A/en
Application granted granted Critical
Publication of CN113160880B publication Critical patent/CN113160880B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Analytical Chemistry (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm, which comprises the following steps of: s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS; s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation; s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS; s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method. The method introduces high-order proximity to reconstruct a similarity matrix of the lncRNA and the disease, establishes a better measurement standard to accurately describe the similarity relation between the medicines or the disease, and adopts the construction of an isomeric matrix to utilize the similarity information of the lncRNA and the disease to assist in prediction, thereby realizing more accurate lncRNA-disease association prediction.

Description

lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm
Technical Field
The invention relates to the field of combination of machine learning and biological genes, in particular to an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm.
Background
Lncrnas are a class of non-coding RNAs of more than 200 nucleotides. Numerous studies have shown that lncRNA plays a key role in many important biological processes, including translation, splicing, differentiation, epigenetic regulation, and immune responses. In recent years, scientists find that lncRNA overexpression or dysregulation is closely related to complex diseases such as various cancers, such as liver cancer (HCC), gastric cancer, breast cancer, bladder cancer, Parkinson Disease (PD) and the like. Therefore, the development of a calculation method for deducing the relation between the potential disease-lncRNA can not only accelerate the diagnosis and treatment of the disease, but also understand the mechanism of the disease from the molecular level. In addition, the time cost can be reduced by developing a calculation method, and an effective experimental direction is provided for biological research. Therefore, the identification of potential lncRNAs associated with disease is of great importance for the discovery of disease biomarkers and the treatment, diagnosis and prevention of human complex diseases. In view of the time and labor consumption of traditional experiments, the computational model can be used as an effective auxiliary tool for identifying lncRNA disease association.
Over the years, a number of computational methods have been developed to infer potential lncRNA-disease associations. The existing calculation methods can be roughly divided into three categories, namely (1) a machine learning-based method (2) a network-based method (3) a matrix completion-based method.
Machine learning-based methods generally assume that functionally similar lncrnas are associated with the same disease or closely related diseases. Unfortunately, most machine learning methods rely heavily on known labeled samples, which leads to difficulties in machine learning-based algorithms in classifying negative samples, which are difficult to obtain in practical situations, because there are usually only reports of positive lncrnas associated with disease. The use of a large number of unknown samples as negative samples may falsely classify the potential lncRNA-disease association as negative samples, which will affect the prediction accuracy of the method.
In order to reduce the burden of negative samples, the network-based lncRNA-disease association prediction method is also a popular research direction recently. If the publication date is 2019, month 01 and 18, and chinese patent publication No. CN109243538A discloses a method and system for predicting the association between diseases and LncRNA, comprising: acquiring an LncRNA-miRNA association relation and an miRNA-disease association relation from a known database, and constructing an LncRNA-miRNA-disease interaction network according to the LncRNA-miRNA association relation and the miRNA-disease association relation; constructing a disease super-expression profile and a LncRNA super-expression profile based on the LncRNA-miRNA-disease interaction network; according to the disease super-expression profile and the LncRNA super-expression profile, adopting LncRNA similarity calculation and disease similarity calculation based on the RBF neural network to train a prediction model of the correlation between the disease and the LncRNA; prediction of LncRNA-disease association pairs of candidate samples is performed using a predictive model. However, since the lncRNA-disease correlation verified by experiments is still insufficient, the prediction accuracy of the method is still affected by the fact that some lncRNA nodes and some disease nodes may have associated paths. In addition, when a new disease or lncRNA node is introduced, the network-based method faces a cold start problem, i.e., the node cannot be predicted, so that the network-based method often needs to consider the prediction of a single node or integrate additional biological information. Meanwhile, although the integration of other biological information can improve the prediction performance, such as the connection between genes and diseases, the connection between mirnas and diseases, and the like, some interactions of other biological information may include some noises that interfere with the prediction result.
The third method uses matrix completion to mine the association of lncRNA with disease. The main idea is to update LncRNA-disease adjacency matrix and recover its missing entries, assuming that the elements in the final iteration result are as close as possible to the elements in the original adjacency matrix. Compared with the other two methods, the matrix completion method can capture the overall pattern of lncRNA-disease association, reduce the false positive rate and does not need negative samples. However, existing matrix completion methods all fuse similar information of diseases and lncrnas to assist in association prediction, but all focus on using dominant similar information to predict correlations between lncrnas and diseases, such as lncRNA functional similarity, disease semantic similarity, and the like, and ignore higher-order implicit similarities between lncrnas and diseases.
The discovery of potential lncRNA-disease associations undoubtedly greatly aids in the study of understanding disease pathogenesis and developing treatments for human diseases. Because the traditional biological experiment is time-consuming and labor-consuming, an efficient and reliable calculation and prediction method is urgently needed. Therefore, the development of a calculation method to reveal unknown association of lncRNA and diseases is not only beneficial to understanding the main functions of lncRNA in the pathology and molecular change of human diseases, but also beneficial to the prognosis, treatment and prevention of complex diseases.
However, all methods applied to lncRNA-disease association prediction so far focus on linear original lncRNA and disease similarity information, although they use the similar information of disease and lncRNA to assist in association prediction. Meanwhile, since negative lncRNA-disease associated samples are difficult to obtain in practical situations, the prediction accuracy of many calculation methods requiring negative samples is affected. Also in the matrix completion algorithm, 1 in the lncRNA-disease association matrix represents a known drug-disease association and 0 represents unknown. Reasonable predictions should be in the range of 0,1, indicating the likelihood of predictive relevance. However, most of the current matrix completion methods cannot avoid the situation that the predicted value exceeds the range of [0,1], which brings difficulty to biological interpretation.
Disclosure of Invention
The invention provides an lncRNA-disease association prediction method based on high-order proximity and a matrix completion algorithm, which can better predict lncRNA-disease association.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm comprises the following steps:
s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS;
s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation;
s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS;
s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method.
Preferably, the lncRNA similarity matrix LS in step S1 specifically includes:
downloading lncRNA expression profile from Arrayexpress, and generating by RNA-Seq technology; on the basis of the previous research, expressing similarity of lncRNA is expressed by calculating a sperman correlation coefficient between each lncRNA pair expression profile, and expressing similarity of lncRNA li and lncRNA lj is described by a matrix LS (li, lj), wherein the similarity is between 0 and 1; the higher the similarity of the expression of lncRNA li and lncRNA lj, the higher the score.
Preferably, the disease similarity matrix DS in step S1 specifically includes:
after downloading the grid description from the national library of medicine MeSH, a model based on directed acyclic graph DAG is introduced to describe semantic similarity between diseases, and Directed Acyclic Graph (DAG) can be used to describe disease d, i.e. DAG (d), (t), (d), e (d), where t (d) is a node set and e (d) is an edge set, and for a given specific disease d, the contribution value of its ancestor node q in DAG (d) is defined as follows:
Figure BDA0002984130360000031
in conjunction with the contribution of its ancestor nodes in dag (d), the semantic value of disease d can be described as:
Figure BDA0002984130360000041
semantic similarity between two diseases can be considered higher if there are more shared nodes in the DAG for the two diseases, using the semantic similarity matrix DS (di, dj) to represent the semantic similarity between disease di and disease dj, defined as:
Figure BDA0002984130360000042
preferably, the step S1 calculates a high-order proximity matrix of the disease similarity matrix DS, specifically:
constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:
Figure BDA0002984130360000043
wherein DSnIs the n-order proximity of DS, y is the weight parameter and y is more than or equal to 0;
singular value decomposition techniques are used to improve data quality:
HD=UΣVT
wherein U is E.Rnd×ndIs a left singular vector matrix, sigma ∈ Rnd×ndIs a singular value descending diagonal matrix, V belongs to Rnd×ndIs a right singular vector matrix;
the high order proximity matrix HD is then reconstructed by keeping the k largest singular values:
Figure BDA0002984130360000044
whereinkFor k matrices of singular values, UkAnd VkThe top-k singular values respectively correspond to the left singular vector matrix and the right singular vector matrix.
Preferably, the calculation method of the high-order approximation matrix HL of the lncRNA similarity matrix LS is the same as the calculation method of the high-order approximation matrix HD of the disease similarity matrix DS.
Preferably, the disease-lncRNA adjacency matrix DL obtained in step S2 is specifically:
downloading lncRNA-disease association data set from LncRNADisease database, deleting repeated lncRNA, disease and non-human data in lncRNA-disease association data set, and using disease-lncRNA adjacency matrix DL epsilon Rnd×nlWhere nd and nl are the number of diseases and the number of lncRNA, respectively, the disease-lncRNA adjacency matrix DL is defined as follows:
Figure BDA0002984130360000045
preferably, in step S3, an isomeric disease-lncRNA association matrix is constructed, which is specifically defined as:
Figure BDA0002984130360000051
preferably, the matrix completion method in step S4 is used to associate the disease-lncRNA with the element with DL value 0 in the matrix T.
Preferably, in step S4, a matrix completion method is used to predict lncRNA-disease association, specifically:
let omega be the observation item X epsilon Rm×nIndex set of (1), PΩ(X)∶Rm×n→Rm×nIs a linear projection operator:
Figure BDA0002984130360000052
and (3) deducing missing values by assuming a low-rank matrix X by adopting a low-rank matrix completion algorithm, wherein the model is described as follows:
Figure BDA0002984130360000053
s.t 0≤X≤1
where ω, α are the non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1.
Preferably, in step S4, a multiplier method with alternating directions is used to transform the model into the problem to be optimized, and a variable matrix Y is introduced, where the model can be optimized as follows:
Figure BDA0002984130360000054
s.t X=Y,0≤Y≤1
accordingly, the augmented Lagrangian function corresponding to this equation is:
Figure BDA0002984130360000055
wherein Z is the standard trace inner product, beta>0 is an adaptive penalty parameter that requires alternate updates of Y in the kth iterationk+1、Xk+1And Zk+1
Calculating Yk+1: we fix XkAnd ZkMinimizing YkIs/are as follows
Figure BDA0002984130360000056
Handle
Figure BDA0002984130360000057
Is denoted as PΩThe associated operator of (a) is selected,
Figure BDA0002984130360000058
Yk+1the update is as follows:
Figure BDA0002984130360000059
Figure BDA0002984130360000061
calculating Xk+1: we anchor YkAnd ZkTo calculate Xk+1
Based on singular value threshold algorithm, Xk+1Is represented as follows:
Figure BDA0002984130360000062
wherein Sτ(.) is defined as
Figure BDA0002984130360000063
Wherein τ is the contraction threshold, σdIs the d-th singular value of the matrix R, and udAnd vdRespectively corresponding left and right singular vectors;
calculating Zk+1: finally, Zk+1The calculation is as follows:
Zk+1=Zk+γβ(Xk+1-Yk+1)
where γ is a non-negative learning rate.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the method introduces high-order proximity to reconstruct a similarity matrix of the lncRNA and the disease, establishes a better measurement standard to accurately describe the similarity relation between the medicines or the diseases, adopts the construction of an isomeric matrix to utilize the similarity information of the lncRNA and the disease to assist in prediction, designs a matrix completion algorithm with limited predicted values to predict the correlation possibility of the lncRNA and the disease, and realizes more accurate correlation prediction of the lncRNA-disease.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a diagram of AUCs implementing HOPMC, GMCLDA, DSCMF, SIMCLDA, BRWLDA and RWRNCD based on leave-one-out cross-validation in the example.
FIG. 3 is a schematic diagram of AUCs implemented by HOPMC, GMCLDA, DSCMF, SIMCLDA, BRWLDA and RWRNCD based on 5-fold cross validation in the example.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a lncRNA-disease association prediction method based on a higher-order proximity and matrix completion algorithm, as shown in fig. 1, including the following steps:
s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS;
s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation;
s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS;
s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method.
The lncRNA similarity matrix LS in step S1 specifically includes:
downloading lncRNA expression profile from Arrayexpress, and generating by RNA-Seq technology; on the basis of the previous research, expressing similarity of lncRNA is expressed by calculating a sperman correlation coefficient between each lncRNA pair expression profile, and expressing similarity of lncRNA li and lncRNA lj is described by a matrix LS (li, lj), wherein the similarity is between 0 and 1; the higher the similarity of the expression of lncRNA li and lncRNA lj, the higher the score.
In step S1, the disease similarity matrix DS specifically includes:
after downloading the grid description from the national library of medicine MeSH, a model based on directed acyclic graph DAG is introduced to describe semantic similarity between diseases, and Directed Acyclic Graph (DAG) can be used to describe disease d, i.e. DAG (d), (t), (d), e (d), where t (d) is a node set and e (d) is an edge set, and for a given specific disease d, the contribution value of its ancestor node q in DAG (d) is defined as follows:
Figure BDA0002984130360000071
in conjunction with the contribution of its ancestor nodes in dag (d), the semantic value of disease d can be described as:
Figure BDA0002984130360000072
semantic similarity between two diseases can be considered higher if there are more shared nodes in the DAG for the two diseases, using the semantic similarity matrix DS (di, dj) to represent the semantic similarity between disease di and disease dj, defined as:
Figure BDA0002984130360000081
in step S1, calculating a high-order proximity matrix of the disease similarity matrix DS, specifically:
according to biological experimental observation, one of the basic hypotheses for lncRNA and disease prediction is that lncRNA with similar functions are often associated with phenotypically similar diseases, and vice versa. Therefore, the similarity measure of lncRNA to disease is key to predicting lncRNA-disease association. Higher order similarities may describe indirect similarity information between matrix elements, as opposed to explicit pairwise similarities. For example, in a network, if vi and vj have more common neighbors and rich path information, then vi has a high probability of reaching node vj through 2 random walks, which means that the second-order proximity of the two nodes is too high. Therefore, inferring the similarity measure of lncRNA to disease by considering higher order proximity would help us to more efficiently express similar information of lncRNA to disease. Constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:
Figure BDA0002984130360000082
wherein DSnIs the n-order proximity of DS, y is the weight parameter and y is more than or equal to 0;
however, due to the high dimensionality of the matrix, noise may be present in the matrix HD, and singular value decomposition techniques are employed to improve data quality, the details of SVD are as follows:
HD=UΣVT
wherein U is E.Rnd×ndIs a left singular vector matrix, sigma ∈ Rnd×ndIs a singular value descending diagonal matrix, V belongs to Rnd×ndIs a right singular vector matrix, nd is the number of diseases;
the high order proximity matrix HD is then reconstructed by keeping the k largest singular values:
Figure BDA0002984130360000083
whereinkFor k matrices of singular values, UkAnd VkThe top-k singular values respectively correspond to the left singular vector matrix and the right singular vector matrix.
The calculation method of the high-order proximity matrix HL of the lncRNA similarity matrix LS is the same as that of the high-order proximity matrix HD of the disease similarity matrix DS.
In step S2, acquiring the disease-lncRNA adjacency matrix DL specifically as follows:
downloading lncRNA-disease association data set comprising 687 experimentally validated lncRNA-disease associations from LncRNADisease database, deleting duplicate lncRNA, disease and non-human data therein, and finally obtaining 540 unique experimentally validated lncRNA-disease associations between 115 unique lncRNA and 178 unique diseases, using disease-lncRNA adjacency matrix DL e Rnd×nlThese associations are described, where nd and nl are the number of diseases and lncRNA, respectively, and if disease di is associated with lncRNA (j), then DL (i, j) ═ 1, otherwise 0, and disease-lncRNA adjacency matrix DL is defined as follows:
Figure BDA0002984130360000091
in step S3, an isomeric disease-lncRNA association matrix is constructed, which is specifically defined as:
Figure BDA0002984130360000092
the matrix completion method in step S4 is used to associate the disease-lncRNA with the element with DL value of 0 in the matrix T.
In step S4, a matrix completion method is used to predict lncRNA-disease association, specifically:
based on the hypothesis that functionally similar lncrnas tend to be involved in lncrnas of similar diseases and disease prediction, the underlying factors that determine the likelihood of lncrnas being associated with a disease tend to be highly correlated, which leads to the existence of correlations in the corresponding data matrix. Therefore, in the disease-lncRNA associated heterogeneous neighbor matrix T, lncRNA has a limited number of independent factors interacting with disease, which results in the formation of a low rank structure by the heterogeneous neighbor matrix T. Therefore, we used matrix completion to predict potential disease-lncRNA associations.
Let omega be the observation item X epsilon Rm×nIndex set of (1), PΩ(X)∶Rm×n→Rm×nIs a linear projection operator:
Figure BDA0002984130360000093
and (3) deducing missing values by assuming a low-rank matrix X by adopting a low-rank matrix completion algorithm, wherein the model is described as follows:
Figure BDA0002984130360000094
s.t 0≤X≤1
where ω, α are non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1, which makes the results biologically easier to interpret.
In step S4, a multiplier method in alternate directions is used to convert the model into the problem to be optimized, and a variable matrix Y is introduced, where the model may be optimized as follows:
Figure BDA0002984130360000101
s.t X=Y,0≤Y≤1
accordingly, the augmented Lagrangian function corresponding to this equation is:
Figure BDA0002984130360000102
wherein Z is the standard trace inner product, beta>0 is an adaptive penalty parameter that requires alternate updates of Y in the kth iterationk+1、Xk+1And Zk+1
Calculating Yk+1: we fix XkAnd ZkMinimizing YkIs/are as follows
Figure BDA0002984130360000103
Handle
Figure BDA0002984130360000104
Is denoted as PΩThe associated operator of (a) is selected,
Figure BDA0002984130360000105
Yk+1the update is as follows:
Figure BDA0002984130360000106
calculating Xk+1: we anchor YkAnd ZkTo calculate Xk+1
Based on singular value threshold algorithm, Xk+1Is represented as follows:
Figure BDA0002984130360000107
wherein Sτ(.) is defined as
Figure BDA0002984130360000108
Wherein τ is the contraction threshold, σdIs the d-th singular value of the matrix R, and udAnd vdRespectively corresponding left and right singular vectors;
calculating Zk+1: finally, Zk+1The calculation is as follows:
Zk+1=Zk+γβ(Xk+1-Yk+1)
where γ is a non-negative learning rate, it may be set to 1 in this embodiment.
To examine the prediction accuracy of the method of the present embodiment (HOPMC), HOPMC was compared with 5 advanced methods GMCLDA, SIMCLDA, DSCMF, BRWLDA and RWRlncD. As can be seen from fig. 2, the area under the HOPMC curve AUC is 0.8757, which is larger than other calculation methods (GMCLDA 0.8501, SIMCLDA 0.8237, DSCMF 0.8176, BRWLDA0.7969, RWRlncD 0.6540) under the framework of one cross validation, indicating that the performance of the HOPMC is better than that of other calculation methods. To further validate the predicted performance of HOPMC, validation was performed using a 5-fold cross-validation framework. As can be seen from fig. 3, HOPMC can give a reliable AUC of 0.8353 ± 0.0045, far exceeding AUC values 0.7894 ± 0.0040, 0.7839 ± 0.0045, 0.7734 ± 0.0045, 0.7659 ± 0.0045 and 0.6179 ± 0.0045. This means that HOPMC is more efficient under the 5-fold cross-validation framework than other methods. The above results fully indicate that the HOPMC method is superior to other compared methods, and is more favorable for predicting the lncRNA-disease correlation.
HOPMC was also used to predict the utility of known lncRNA in the prediction of actual lncRNA-disease. In predicting new lncRNA-disease associations, we use the known lncRNA-disease associations as a training dataset for the HOPMC, and then calculate and rank the prediction scores for each unknown lncRNA-disease pair. We chose osteosarcoma, gastric cancer and hepatocellular carcinoma as case studies. The top 10 Cancer lncrnas were validated in the third party databases (Lnc2Cancer and MNDR). The results are shown in Table I, Table II and Table III, which indicate that 100%, 90% and 90% of the predicted lncRNA are associated with cancer
In addition, HOPMC predicted some not proven lncRNA-diseases, including MINA and osteosarcoma, PCA3 and hepatocellular carcinoma, etc. These predicted associations have not been reported in the current literature, but there is a greater likelihood that medical researchers will be available to study and validate these associations.
TABLE-HOPMC predicted potential lncRNA in the first 10 th class associated with gastric cancer
Figure BDA0002984130360000111
TABLE-two HOPMC predicted top 10 potential lncRNA associated with osteosarcoma
Figure BDA0002984130360000121
TABLE TRIHOPMC predicts the first 10 potential lncRNA associated with hepatocellular carcinoma
Figure BDA0002984130360000122
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm is characterized by comprising the following steps:
s1: calculating a high-order proximity matrix of the lncRNA similarity matrix LS and the disease similarity matrix DS;
s2: obtaining a disease-lncRNA adjacency matrix DL, wherein the disease-lncRNA adjacency matrix is used for describing lncRNA-disease association relation;
s3: constructing an isomeric disease-lncRNA correlation matrix, wherein the disease-lncRNA correlation matrix integrates a higher order approach matrix of a disease-lncRNA adjacency matrix DL and an lncRNA similarity matrix LS and a higher order approach matrix of a disease similarity matrix DS;
s4: predicting lncRNA-disease association in the disease-lncRNA association matrix using a matrix completion method.
2. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 1, wherein the lncRNA similarity matrix LS in step S1 is specifically:
downloading lncRNA expression profile from Arrayexpress, and generating by RNA-Seq technology; on the basis of the previous research, expressing similarity of lncRNA is expressed by calculating a sperman correlation coefficient between each lncRNA pair expression profile, and expressing similarity of lncRNA li and lncRNA lj is described by a matrix LS (li, lj), wherein the similarity is between 0 and 1; the higher the similarity of the expression of lncRNA li and lncRNA lj, the higher the score.
3. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 1, wherein the disease similarity matrix DS in step S1 is specifically:
after downloading the grid description from the national library of medicine MeSH, a model based on directed acyclic graph DAG is introduced to describe semantic similarity between diseases, and Directed Acyclic Graph (DAG) can be used to describe disease d, i.e. DAG (d), (t), (d), e (d), where t (d) is a node set and e (d) is an edge set, and for a given specific disease d, the contribution value of its ancestor node q in DAG (d) is defined as follows:
Figure FDA0002984130350000011
in conjunction with the contribution of its ancestor nodes in dag (d), the semantic value of disease d can be described as:
Figure FDA0002984130350000012
semantic similarity between two diseases can be considered higher if there are more shared nodes in the DAG for the two diseases, using the semantic similarity matrix DS (di, dj) to represent the semantic similarity between disease di and disease dj, defined as:
Figure FDA0002984130350000021
4. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 3, wherein the higher-order proximity matrix of the disease similarity matrix DS is calculated in step S1, and specifically comprises:
constructing a q-order proximity matrix HD on the basis of the disease similarity matrix DS so as to keep different order proximity information of the disease semantic similarity matrix as follows:
Figure FDA0002984130350000022
wherein DSnIs the n-order proximity of DS, y is the weight parameter and y is more than or equal to 0;
singular value decomposition techniques are used to improve data quality:
HD=UΣVT
wherein U is E.Rnd×ndIs a left singular vector matrix, sigma ∈ Rnd×ndIs a singular value descending diagonal matrix, V belongs to Rnd×ndIs a right singular vector matrix;
the high order proximity matrix HD is then reconstructed by keeping the k largest singular values:
Figure FDA0002984130350000023
whereinkFor k matrices of singular values, UkAnd VkThe top-k singular values respectively correspond to the left singular vector matrix and the right singular vector matrix.
5. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 4, wherein the calculation method of the higher-order proximity matrix HL of the lncRNA similarity matrix LS is the same as the calculation method of the higher-order proximity matrix HD of the disease similarity matrix DS.
6. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 5, wherein the disease-lncRNA adjacency matrix DL obtained in step S2 specifically comprises:
downloading lncRNA-disease association data set from LncRNADisease database, deleting repeated lncRNA, disease and non-human data in lncRNA-disease association data set, and using disease-lncRNA adjacency matrix DL epsilon Rnd×nlWhere nd and nl are the number of diseases and the number of lncRNA, respectively, the disease-lncRNA adjacency matrix DL is defined as follows:
Figure FDA0002984130350000031
7. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm according to claim 6, wherein an isomeric disease-lncRNA association matrix is constructed in step S3, and is specifically defined as:
Figure FDA0002984130350000032
8. the lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm of claim 7, wherein the matrix completion method in step S4 is used to determine the DL value of 0 element in the disease-lncRNA association matrix T.
9. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm of claim 8, wherein the matrix completion method is adopted in step S4 to predict lncRNA-disease association, specifically:
let omega be the observation item X epsilon Rm×nIndex set of (1), PΩ(X):Rm×n→Rm×nIs a linear projection operator:
Figure FDA0002984130350000033
and (3) deducing missing values by assuming a low-rank matrix X by adopting a low-rank matrix completion algorithm, wherein the model is described as follows:
Figure FDA0002984130350000034
s.t 0≤X≤1
where ω, α are the non-negative parameters that balance the trace norm and the kernel norm, a constraint of 0 ≦ X ≦ 1 is used to ensure that the recovered matrix elements have values between 0 and 1.
10. The lncRNA-disease association prediction method based on the higher-order proximity and matrix completion algorithm as claimed in claim 9, wherein in step S4, the alternative direction multiplier method is used to transform the model into the problem to be optimized, and a variable matrix Y is introduced, and the model can be optimized as follows:
Figure FDA0002984130350000035
s.t X=Y,0≤Y≤1
accordingly, the augmented Lagrangian function corresponding to this equation is:
Figure FDA0002984130350000041
wherein Z is the standard trace inner product, beta>0 is an adaptive penalty parameter that requires alternate updates of Y in the kth iterationk+1、Xk+1And Zk+1
Calculating Yk+1: we fix XkAnd ZkMinimizing YkIs/are as follows
Figure FDA0002984130350000042
Handle
Figure FDA0002984130350000043
Is denoted as PΩThe associated operator of (a) is selected,
Figure FDA0002984130350000044
Yk+1the update is as follows:
Figure FDA0002984130350000045
calculating Xk+1: we anchor YkAnd ZkTo calculate Xk+1
Based on singular value threshold algorithm, Xk+1Is represented as follows:
Figure FDA0002984130350000046
wherein Sτ(.) is defined as
Figure FDA0002984130350000047
Wherein τ is the contraction threshold, σdIs the d-th singular value of the matrix R, and udAnd vdRespectively corresponding left and right singular vectors;
calculating Zk+1: finally, Zk+1The calculation is as follows:
Zk+1=Zk+γβ(Xk+1-Yk+1)
where γ is a non-negative learning rate.
CN202110295353.9A 2021-03-19 2021-03-19 lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm Active CN113160880B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110295353.9A CN113160880B (en) 2021-03-19 2021-03-19 lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110295353.9A CN113160880B (en) 2021-03-19 2021-03-19 lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm

Publications (2)

Publication Number Publication Date
CN113160880A true CN113160880A (en) 2021-07-23
CN113160880B CN113160880B (en) 2023-06-06

Family

ID=76887938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110295353.9A Active CN113160880B (en) 2021-03-19 2021-03-19 lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm

Country Status (1)

Country Link
CN (1) CN113160880B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096331A (en) * 2016-06-12 2016-11-09 中南大学 A kind of method inferring lncRNA and disease contact
CN109243538A (en) * 2018-07-19 2019-01-18 长沙学院 A kind of method and system of predictive disease and LncRNA incidence relation
CN109935332A (en) * 2019-03-01 2019-06-25 桂林电子科技大学 A kind of miRNA- disease association prediction technique based on double random walk models
CN110782945A (en) * 2019-10-22 2020-02-11 长沙学院 Method for identifying correlation between lncRNA and disease by using indirect and direct characteristic information
US20200208153A1 (en) * 2018-12-28 2020-07-02 The Florida International University Board Of Trustees Long noncoding rnas in pulmonary airway inflammation
CN112289373A (en) * 2020-10-27 2021-01-29 齐齐哈尔大学 lncRNA-miRNA-disease association method fusing similarity
CN112420127A (en) * 2020-10-26 2021-02-26 大连民族大学 Non-coding RNA and protein interaction prediction method based on secondary structure and multi-model fusion

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096331A (en) * 2016-06-12 2016-11-09 中南大学 A kind of method inferring lncRNA and disease contact
CN109243538A (en) * 2018-07-19 2019-01-18 长沙学院 A kind of method and system of predictive disease and LncRNA incidence relation
US20200208153A1 (en) * 2018-12-28 2020-07-02 The Florida International University Board Of Trustees Long noncoding rnas in pulmonary airway inflammation
CN109935332A (en) * 2019-03-01 2019-06-25 桂林电子科技大学 A kind of miRNA- disease association prediction technique based on double random walk models
CN110782945A (en) * 2019-10-22 2020-02-11 长沙学院 Method for identifying correlation between lncRNA and disease by using indirect and direct characteristic information
CN112420127A (en) * 2020-10-26 2021-02-26 大连民族大学 Non-coding RNA and protein interaction prediction method based on secondary structure and multi-model fusion
CN112289373A (en) * 2020-10-27 2021-01-29 齐齐哈尔大学 lncRNA-miRNA-disease association method fusing similarity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
阳金豆 等: ""长链非编码RNA 与疾病关联关系的预测方法研究"", 《智能计算机与应用》 *
阳金豆 等: ""长链非编码RNA 与疾病关联关系的预测方法研究"", 《智能计算机与应用》, 31 August 2020 (2020-08-31), pages 135 - 139 *

Also Published As

Publication number Publication date
CN113160880B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
Karim et al. Deep learning-based clustering approaches for bioinformatics
Huang et al. Updated review of advances in microRNAs and complex diseases: taxonomy, trends and challenges of computational models
Jiang et al. Protein secondary structure prediction: A survey of the state of the art
Shen et al. LPI-KTASLP: prediction of lncRNA-protein interaction by semi-supervised link learning with multivariate information
Lei et al. A comprehensive survey on computational methods of non-coding RNA and disease association prediction
CN113241115A (en) Depth matrix decomposition-based circular RNA disease correlation prediction method
Sasank et al. An automatic tumour growth prediction based segmentation using full resolution convolutional network for brain tumour
Gao et al. Graph regularized L 2, 1-nonnegative matrix factorization for miRNA-disease association prediction
Rahman et al. Feature selection from colon cancer dataset for cancer classification using artificial neural network
CN113488104B (en) Cancer driving gene prediction method and system based on local and global network centrality analysis
Zhou et al. Predicting miRNA–Disease Associations Through Deep Autoencoder With Multiple Kernel Learning
Wang et al. A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences
CN115995293A (en) Circular RNA and disease association prediction method
CN115602243A (en) Disease associated information prediction method based on multi-similarity fusion
Bhardwaj et al. Computational biology in the lens of CNN
Kanwal et al. A multimodal deep learning infused with artificial algae algorithm–An architecture of advanced E-health system for cancer prognosis prediction
CN113539479B (en) Similarity constraint-based miRNA-disease association prediction method and system
Gao et al. A new method based on matrix completion and non-negative matrix factorization for predicting disease-associated miRNAs
CN116741408A (en) Method for multi-view self-attention prediction of drug to disease association
CN113160880A (en) lncRNA-disease association prediction method based on high-order proximity and matrix completion algorithm
Zhao et al. Contrastive clustering with a graph consistency constraint
Prabakaran et al. Robust hyperparameter tuned deep Elman neural network for the diagnosis of osteosarcoma on histology images
Qiao et al. Potential circRNA-disease association prediction using DeepWalk and nonnegative matrix factorization
Hu et al. Predicting electrical evoked potential in optic nerve visual prostheses by using support vector regression and case-based prediction
Han et al. Hessian Regularized L 2, 1-Nonnegative Matrix Factorization and Deep Learning for miRNA–Disease Associations Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant