CN113539479A

CN113539479A - Similarity constraint-based miRNA-disease association prediction method and system

Info

Publication number: CN113539479A
Application number: CN202110730370.0A
Authority: CN
Inventors: 王红; 余盛朋; 梁成; 王正军; 杨杰
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-22
Anticipated expiration: 2041-06-29
Also published as: CN113539479B

Abstract

The invention provides a miRNA-disease association prediction method and system based on similarity constraint, wherein the method comprises the following steps: acquiring a miRNA-disease association matrix, a miRNA functional similarity matrix and a disease semantic similarity matrix; and based on a similarity constrained target function, taking the adjacency matrix, the miRNA function similarity matrix and the disease semantic similarity matrix as training data, and performing adaptive learning to obtain a new miRNA-disease association matrix. The invention synchronously uses similarity constraint learning to reveal the correlation between miRNA and diseases, and has good prediction performance and robustness.

Description

Similarity constraint-based miRNA-disease association prediction method and system

Technical Field

The invention belongs to the technical field of computer-aided disease diagnosis, and particularly relates to a miRNA-disease association prediction method and system based on similarity constraint.

Background

mirnas are small molecules in organisms that are about 20-24 nucleotides in length. It regulates the life processes of an organism by preventing degradation or translational inhibition of messenger rna (mrna). In recent years, a large number of studies have shown that mirnas play important roles in immune response, transcription, cell proliferation, cell differentiation, signal transduction, embryonic development, and the like. mutation and dysfunction of miRNA can cause various diseases, especially plays a significant role in the diagnosis, treatment and prognosis of cancer. The identification of the miRNA-disease association relationship refers to semi-supervised learning based on the existing biological data (including but not limited to miRNA functional similarity data, disease semantic similarity data and known human miRNA-disease association relationship data), and then through data training and iteration, an excellent prediction model is trained to predict and mine a new biological data relationship.

In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:

the existing mining method based on biological experiments has the defects of high experimental cost, long experimental period and waste of a large amount of manpower and resources. Meanwhile, with the rapid development of information technology sequencing technology, various types of biological data show an explosive growth state. The traditional test method cannot rapidly mine general patterns and effective information meeting human requirements from the massive biological data, and the mining of miRNA-disease association relation in bioinformatics is greatly hindered.

Although there are many excellent methods and works based on computational models, they all achieve very good prediction performance and can be applied to large-scale biological databases for relationship mining, including graph-based topological similarity methods based on semi-supervised learning methods, machine learning based methods, graph neural network based methods, etc. The methods have good prediction performance and results, but also have some universal problems and challenges, for example, experimental data has noise, and extracted effective data is too little to cause insufficient model training to be applied to new association relationship identification. Secondly, the massive data are biological data with different dimensions and different characteristics, and how to utilize the cross-source heterogeneous data is also a great challenge.

Disclosure of Invention

In order to solve the problems, the invention discloses a miRNA-disease association relationship mining method and system based on similarity information constraint.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

a miRNA-disease association prediction method based on similarity constraint comprises the following steps:

acquiring a miRNA-disease association matrix, a miRNA functional similarity matrix and a disease semantic similarity matrix;

and based on a similarity constrained target function, taking the adjacency matrix, the miRNA function similarity matrix and the disease semantic similarity matrix as training data, and performing adaptive learning to obtain a new miRNA-disease association matrix.

Further, obtaining the miRNA-disease association matrix includes: and acquiring relation data of miRNA and diseases, and constructing an adjacency matrix.

Further, obtaining the disease semantic similarity matrix comprises:

acquiring disease semantic data, and constructing a directed acyclic graph, wherein nodes represent diseases, and directed edges among the nodes represent hierarchical relations among the diseases;

and calculating the semantic similarity between diseases by using the accumulated sum of the contribution values of the ancestor nodes to the node as the semantic value of the node to obtain a disease semantic similarity matrix.

Further, the objective function of the similarity constraint is:

wherein SM represents a new miRNA functional similarity matrix, SD represents a new disease semantic similarity matrix, F represents a new miRNA-disease association matrix, AM represents a miRNA functional similarity matrix, AD represents a disease semantic similarity matrix, and F_i，F_jRespectively represent the association vectors of the ith miRNA and the jth miRNA with all diseases.

One or more embodiments provide a miRNA-disease association prediction system based on similarity constraints, comprising:

a true association data acquisition module configured to: acquiring a miRNA-disease incidence matrix;

a similarity matrix acquisition module configured to: acquiring a miRNA functional similarity matrix and a disease semantic similarity matrix;

an adaptive learning prediction module configured to: and based on a similarity constrained target function, taking the adjacency matrix, the miRNA function similarity matrix and the disease semantic similarity matrix as training data, and performing adaptive learning to obtain a new miRNA-disease association matrix.

Further, obtaining the disease semantic similarity matrix comprises:

Further, the objective function of the similarity constraint is:

One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the miRNA-disease association prediction method when executing the program.

One or more embodiments provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the miRNA-disease association prediction method.

One or more technical schemes have the following technical effects:

in consideration of the disease semantic similarity and the sparsity and incompleteness of the miRNA function similarity matrix, the miRNA and disease similarity network based on the known similarity information in the technical scheme ensures the sufficient mining of the correlation between the subsequent miRNA and the disease.

Compared with the traditional prediction method, the framework can provide more stable performance and better prediction performance, and can be used for predicting new diseases.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a frame diagram of a miRNA-disease association prediction method based on similarity constraint in an embodiment of the present invention;

FIG. 2 is a graph of the distribution of all diseases in the examples of the present invention;

FIG. 3 is a complex disease relationship network constructed in an embodiment of the present invention;

FIG. 4 is a diagram of a DGA computation model according to an embodiment of the present invention;

fig. 5 is a heterogeneous graph model in an embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment discloses a miRNA-disease association prediction method based on similarity constraint, which predicts the association between diseases and miRNA by learning miRNA similar network and disease similar network and using a similarity constraint learning method. As shown in fig. 1, the method specifically comprises the following steps:

step 1: constructing a miRNA-disease association network according to the relation data of miRNA and diseases to obtain an adjacency matrix; constructing a miRNA functional similarity network according to the functional similarity between the miRNAs to obtain a miRNA functional similarity matrix; constructing a disease semantic similarity network according to semantic similarity among diseases to obtain a disease semantic similarity matrix;

(1) construction of miRNA-disease association network

In this example, the hmddv2.0 database was used to construct the heterogeneous network of miRNA and disease. The relationship of miRNAs to disease can be obtained directly from the homepage of hmddv 2.0. It contains 495 miRNAs and 383 diseases, including 5340 interactions. We use M ═ M₁,m₂,...,m_nmAnd D ═ D₁,d₂,...,d_ndRepresents miRNA pool and disease pool, respectively. For more accurate representation, we use matrices

Representing the adjacency matrix of miRNA and disease constituents. In particular, if disease d_iAnd miRNA m_jIn relation, then we will be Y_ijThe value of (d) is set to 1 in the adjacency matrix, and if there is no relation, we set its value to 0. Thus, the ith row of the adjacency matrix Y represents disease d_iAnd all other miRNAs, with miRNA m in column j of adjacency matrix Y_jAnd feature vectors of all other diseases. As shown in fig. 2, the HMDD data set contains 15 types of diseases in total, and there are over 100 cancer-related diseases, which provides a solid data base for studying the association between miRNA and disease.

(2) Construction of functional similarity network of miRNA

If multiple miRNAs have functional similarities, they may cause the same disease. Conversely, if multiple diseases occur simultaneously, it is likely to be caused by abnormal expression of functionally similar miRNAs. Based on the above assumptions, we can calculate the similarity information of miRNA similar networks, and the data can be downloaded directly from the Internet (http:// www.cuilab.cn/files/images/cuilab/misim. zip). We use the adjacency matrix AM to represent the miRNA functional similarity matrix, where entity AM_ijRepresents miRNA m_iAnd m_iFunctional similarity between them.

(3) Building semantic similarity networks for diseases

The Mesh database is a disease classification database widely used by researchers, and can be used for mining potential relations among diseases. In recent years, the database is also used for constructing heterogeneous networks of miRNAs and diseases, and has good effect. To build a semantically similar network of diseases, we can download a data set from the Internet (http:// www.ncbi.nlm.nih.gov /). According to the definition of diseases in MeSH databases, each disease can be represented as a Directed Acyclic Graph (DAG), where nodes represent disease keywords. Hierarchical relationship or semantic association information between diseases can be described through directed edges between nodes. For any D, the semantic relationship between the candidate disease and the other disease is expressed as DAG ═ (D, t (D), e (D)), where t (D) is the set of nodes that contains all ancestor nodes, and e (D) is the set of edges that contains all edges that connect all ancestor nodes. Thus, the more items two diseases share in the DAG model, the more semantically they are similar. According to the above definition, the contribution value of disease D to the semantic value of another disease D can be calculated as follows:

here we define Δ as the semantic contribution parameter. After a lot of work, we found that setting Δ 0.5 is most suitable. The semantic contribution includes two parts: from self and from other diseases. For disease D, the contribution from itself is set to 1, and another disease D_jWith D and D_jThe distance between them decreases and increases. Thus, the semantic value of disease D is calculated according to the following formula:

DV(D)＝∑_t∈T(D)D_d(t) (2)

then, we define disease d_jThe semantic similarity calculation method with the disease dj is as follows:

according to equation (3), we can compute the semantics of a disease based on the semantic similarity network of the diseaseA similarity matrix AD, wherein AD_ijIndicates a disease d_iAnd disease d_jSemantic similarity values between them. As shown in fig. 3, all diseases constitute a disease-like network in which diseases of the same color form a disease cluster. In addition, the disease similarity network also indicates that two diseases in the same disease group have similar semantic similarity. The process of calculating semantic similarity of diseases based on the DGA model is shown in FIG. 4.

Step 2: and based on a similarity constrained target function, taking the adjacency matrix, the miRNA function similarity matrix and the disease semantic similarity matrix as training data, and performing adaptive learning to obtain a new miRNA-disease association matrix.

In order to effectively distinguish the correlation between miRNA and disease, a novel prediction method RSMDA based on robust similarity constraint learning is provided. Our goal is to obtain a reliable indicator matrix

Can reflect the association probability between certain miRNAs and diseases. Further, the objective function needs to satisfy the following two conditions: (1) giving an initial similarity matrix, and learning a new information similarity matrix in a self-adaptive manner; (2) both miRNA space and disease space should be considered in the learning process. For the first requirement, to avoid the case where some rows of the learning matrix are all zero, we add another constraint so that the sum of the learning matrices for each row is equal to one. Therefore, we first define the optimization function in the miRNA space using miRNA similarity constraint learning as follows:

as can be seen from equation (4), the first term

Meaning that the new miRNA similarity matrix SM learned should approach miRNA functional similarity AM. Second item

Indicating that the greater the similarity between miRNA i and miRNAj, the smaller the difference in their feature vectors. Also, we can define the objective function of the disease as follows:

from the formula (5), the first term

The similarity matrix SD of the new disease should be close to the semantic similarity AD of the disease, the second term

It is shown that the greater the similarity between disease i and disease j, the smaller the difference between the feature vectors. From the formula (4) and the formula (5), we learn two optimal similarity matrices of miRNAs and diseases respectively by using a similarity constraint learning method respectively. Then, according to a second requirement, the two optimization functions are unified by using a similar constraint framework. The overall optimization formula is as follows:

from the formula (6), the first two terms adaptively learn a new similarity matrix in the miRNA space, the third term and the fourth term learn a new similarity matrix in the disease space, and the last term is used for constraining the predicted miRNA disease association to be consistent with the basic fact. The problem solved by the model is shown in fig. 5, and unknown information mining and prediction are performed based on a heterogeneous graph model.

As shown in equation (6), the objective function does not simultaneously scale to three variables SD, SM, and F. An efficient alternative optimization algorithm is designed, and the problem is solved iteratively. Specifically, we optimize one variable by modifying the other variables.

(1) Updating SM by fixing SD and F, an iterative formula for SM can be obtained by deriving equation (4), which can be written in the form:

note that the optimization process for each column of SM is independent for different values of i; thus, we can update each column separately as follows:

wherein

Denotes d_iThe jth element of the vector, equation (8), can be written in vector form, as follows:

wherein SM_iAnd AM_iColumns i represent SM and AM, respectively. This problem can be solved by an efficient iterative algorithm.

(2) SD is updated by fixing SM and F. By deriving the formula (5), an optimization iterative formula of SD can be obtained, the optimization process of SD is the same as SM, and the objective function for optimizing SD is as follows:

(3) updating F by fixing SM and SD and introducing laplacian matrix, equation (6) is converted to:

wherein L is_SM＝D_SM-(SM^T+ SM)/2 is the corresponding laplace matrix,

is defined as the ith diagonal element as sigma_j(SM_ij+SM_ji) Degree matrix of diagonal matrix,/2, L_SDDefinition of (A) and L_SMThe same definition is applied.

By differentiating equation (11) from F and setting it to zero, we obtain:

(αL_SM+γI)F+βFL_SD-γY＝0 (12)

it is apparent that equation (12) is a sierweister equation, which is easily solved. Algorithm 1 summarizes the overall process of the method.

Algorithm 1 RSCMDA solving process

Lung cancer, also known as bronchogenic carcinoma, is a common primary malignancy of the respiratory tract in the lung. In recent years, the incidence of cancer worldwide has increased year by year. Therefore, early detection of prognostic and predictive biomarkers associated with lung tumors is of profound significance for the treatment of lung tumors. The role of miRNAs in lung tumor cell progression and drug resistance has been extensively studied. For example, various miRNAs such as hsa-mir-155, hsa-mir-17-3p, hsa-let-7a-2, hsa-mir-145 and hsa-mir-21 were found to be differentially expressed in LN tissue and corresponding non-cancerous lung tissue, and used for further diagnosis and clinical treatment. We used the association information of mirnas with disease provided in hmddv2.0 as training data, and then predicted the first 50 mirnas most correlated with lung tumors using the RSCMDA model [45 ]. The predicted results were then validated against four additional disease-related miRNAs databases. Results (table 1), at least one database from dbDEMC, miR2Disease, miRwayDB and PhenomiR demonstrated that all the first 50 predicted miRNAs were associated with lung tumors. Therefore, the results show that the RSCMDA can accurately predict the relation between the miRNA and the disease.

Table 1 prediction of the first 50 lung tumor-associated miRNAs based on known HMDD correlation

Wherein, I, II, III and IV respectively represent dbDEMC, miR2Disease, mirwayDB and PhenomiR. The first and third columns record 1-25 and 26-50 related miRNAs, respectively.

Example two

The embodiment provides a miRNA-disease association prediction system based on similarity constraint, including:

EXAMPLE III

The embodiment aims at providing an electronic device.

An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program, comprising:

Example four

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, performs the steps of:

Experimental verification

To verify the predictive power of RSCMDA. We measure the predicted performance of RSCMDA from multiple angles using different evaluation criteria.

1. Experimental setup

Through multiple test optimization, the finally proposed RSCMDA model experiment adopts the following parameter setting.

TABLE 2 parameter configuration

Parameter(s)	Value of
		α	1e-4
β	1e-4
		γ	1

2. Results of the experiment

We used the leave-one-out-cross-validation (LOOCV) method to evaluate the performance of RSCMDA. Specifically, the LOOCV method can be divided into two categories: global LOOCV and local LOOCV. They have in common that they retain one known miRNA disease-associated sample at a time for detection, assuming the other samples are unknown samples, and then use RSCMDA for prediction. And after the prediction result is obtained, comparing the scores of all the tested samples with the scores of the unknown samples one by one, and sequencing the scores from high to low. To more intuitively describe the experimental results, the performance of the RSCMDA algorithm was compared to other advanced methods using the receiver operator characteristic curve (ROC curve). The ROC curve was plotted using the True Positive Ratio (TPR) and the false positive ratio (FPT). Taking miRNA disease association prediction herein as an example, for each threshold K (0< K <100), the true positive rate indicates the proportion of K% of known associations prior to the prediction result occupying the known associations for detection, and the false positive rate indicates the proportion of K% of unknown associations prior to the prediction result occupying the unknown associations for testing. To compare models more intuitively, the area under the ROC curve (AUC) was used as a criterion to measure predictive performance.

We compared RSCMDA with other seven predictive frameworks: EGBMMDA, MCMDA, HGIMDA, PBMDA, WBSMDA, HDMP, RLSMDA. Experimental results showed that the AUC of EGBMMDA, MCMDA, HGIMDA, PBMDA, WBSMDA, HDMP, and RLSMDA in global LOOCV were 0.9123, 0.8749, 0.8781, 0.9169, 0.8030, 0.8366, and 0.8426, respectively (fig. 4). In the frame of local LOOCV (fig. 5), they obtained AUC of 0.8221, 0.7718, 0.8077, 0.8341, 0.8031, 0.7702 and 0.6953, respectively.

Next, the performance of RSCMDA in inferring new Disease unrelated to any miRNA in heterogeneous networks was verified using Leave One Disease Out Cross differentiation (LODOCV). In particular, for any candidate disease, we remove all known information about the relevant miRNAs and then use the information about other disease-related miRNAs to prioritize all candidate miRNAs. Because there is no information associated with the disease being investigated, LODOCV is more rigorous and better able to assess the risk of overfitting than the cross-validation framework described above. Under the LODOCV framework, we also use AUC values to validate the capabilities of all methods. Of all methods, our method achieved the highest AUC value of 0.815 in the LODOCV framework. We did not demonstrate the performance of the other comparative methods because all methods yielded AUC values less than 0.5.

Finally, to further demonstrate the predictive and generalization capabilities of RSCMDA on real datasets, we applied RSCMDA in the old version of HMDD (v 1.0). We then used the latest version of HMDD (v2.0) to validate the correlation between predicted mirnas and disease. In particular, in screened HMDD (v 1.0). For each method of comparison, we selected the first N predicted miRNAs with N values between 2000 and 10000, spaced 2000 apart. Then, we count the confirmed true candidates recorded in HMDD (v 2.0). RSCMDA can also recognize more disease-associated miRNAs than the other five calculation methods. In conclusion, the verification results prove that the RSMDA can accurately mine miRNAs related to diseases.

In one or more embodiments, based on the assumption that mirnas with similar functions often cause the same disease, based on the miRNA and the disease association network, the miRNA functional similarity network, and the disease semantic similarity network, the proposed similarity constraint-based objective function is used for solving, so that high-precision miRNA-disease relationship prediction is realized. Specifically, our method will adaptively learn a new information affinity network based on known affinity information during the optimization process. Furthermore, we propose a unified constraint framework to update the predicted results from miRNA and disease spaces simultaneously, rather than learning the results separately from miRNA and disease spaces, which can provide more robust performance.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A miRNA-disease association prediction method based on similarity constraint is characterized by comprising the following steps:

2. The miRNA-disease association prediction method of claim 1, wherein obtaining the miRNA-disease association matrix comprises: and acquiring relation data of miRNA and diseases, and constructing an adjacency matrix.

3. The miRNA-disease association prediction method of claim 1, wherein obtaining a disease semantic similarity matrix comprises:

4. The miRNA-disease association prediction method of claim 1, wherein the similarity-constrained objective function is:

5. A miRNA-disease association prediction system based on similarity constraints, comprising:

6. The miRNA-disease association prediction system of claim 5, wherein obtaining the miRNA-disease association matrix comprises: and acquiring relation data of miRNA and diseases, and constructing an adjacency matrix.

7. The miRNA-disease association prediction system of claim 5, wherein obtaining a disease semantic similarity matrix comprises:

8. The miRNA-disease association prediction system of claim 5, wherein the similarity-constrained objective function is:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the miRNA-disease association prediction method of any one of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the miRNA-disease associated prediction method according to any one of claims 1-4.