CN110782948A

CN110782948A - Method for predicting potential association of miRNA and disease based on constraint probability matrix decomposition method

Info

Publication number: CN110782948A
Application number: CN201910997340.9A
Authority: CN
Inventors: 卢新国; 陈关元; 朱正浩; 李金鑫; 丁莉
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2020-02-11

Abstract

The invention relates to data mining in bioinformatics, in particular to mining of disease bioinformatics data and miRNA gene data. In particular to predicting potential association of miRNA with disease through data mining of miRNA and disease biological information. The method of the invention comprises processing of miRNA and disease-associated data; analysis of disease similarity; analyzing the similarity of miRNA; learning a domain-based association of mirnas with disease; decomposing a constraint probability matrix of the disease-related miRNA; prediction of potential association of mirnas with disease. The invention can be used for predicting the potential association of miRNA and disease and detecting the potential miRNA associated with disease.

Description

Method for predicting potential association of miRNA and disease based on constraint probability matrix decomposition method

Technical Field

The invention relates to data mining in bioinformatics, in particular to mining of disease bioinformatics data and miRNA gene data. In particular to predicting potential association of miRNA with disease through data mining of miRNA and disease biological information.

Background

MicroRNA (miRNA) is a non-coding small RNA with the length of about 20-24 nucleotides, participates in various biological processes such as cell proliferation, development and apoptosis, and plays an important role in regulating and controlling the cell cycle and the development process of organisms. They regulate gene expression by cleavage or translational inhibition through recognition of complementary target sites in the untranslated region of the mRNA. With the development of high-throughput sequencing technology, it is found that miRNA can regulate cell behavior, cause abnormal physiological response of individuals, and lead to the development of complex diseases. Therefore, finding relevant information can lead to better disease diagnosis and help patients to recover, however, the biological verification experiment for exploring the prediction of disease-related mirnas is very expensive and has high failure rate, so that a large amount of unknown mirnas still exist.

For several years, based on the development of miRNA-related databases, many computational methods have been proposed to predict mirnas associated with unknown diseases. First computational method a supervised learning method predicts potential disease-related mirnas by finding the most similar miRNA neighbors. A second approach is to construct a Singular Value Decomposition (SVD) based vector space model that estimates the association of mirnas with disease by considering multiple miRNA-related information sources. The third method is a WBSMDA method that integrates miRNA functional similarity, disease semantic similarity, and nuclear similarity of disease and miRNA gaussian interaction profiles. The fourth method is an MFSP prediction method that calculates miRNA function similarity based on the pathway. In these machine learning-based models and similarity network-based models, one of their common features is the need for a known miRNA-disease relationship in the implementation.

In summary, there is still a gap for new diseases or mirnas due to the lack of sufficient experimentally validated interactions, i.e. rarefaction and imbalance of the heterogeneous omics data. Only a few methods can be used to find potential connections to new diseases or mirnas. Therefore, how to achieve significant performance in the prioritization of disease mirnas remains a challenge.

Disclosure of Invention

The present invention addresses the problems of the above methods and the compelling nature of the analytical detection of disease-related mirnas and the difficulties in achieving disease miRNA prioritization. The invention provides a miRNA disease correlation prediction method based on probability matrix decomposition by combining a matrix decomposition method and a data mining algorithm. The method can predict the interaction between the miRNA and the disease more accurately, and is helpful for finding the new interaction between the miRNA and the disease in the same database. The method comprises the following steps:

1. data collection phase

The method downloads network descriptions from MeSH, which is classified into 16 types: anatomical terms a, biological B, disease C, drugs and chemicals D, etc. Based on Directed Acyclic Graphs (DAGS), various relationships between diseases are obtained from class C. Since DAGS only describes semantic interactions, the method also requires a set of disease-associated genes (DO) in the Disease Ontology (DO) to measure functional similarity of disease to the associated genes. In addition, the method also obtains a weighting gene network in the HumanNet. More than 50 million interactions from 17929 genes of interest were included for calculation of functional similarity. After this time, the method obtained 5186 associations from the HMDD2.0 database, involving 328 diseases and 495 mirnas.

2. Stage of disease similarity analysis

The method uses a grid disease hierarchical directed acyclic graph to compute semantic similarities in disease pair references. DAGS may be represented as DAG (d, TD, ED), where d represents a disease node, TD represents a disease set, and ED represents a linked set in the graph. The method defines the semantic contribution of the child disease k in the DAG relative to the disease t as:

where Δ is the semantic contribution of the edge ET connecting disease d and its child disease k. In order to better measure semantic contributions in terms of distance. The method sets the value of Δ to 0.5. In this method, the disease d is a more specific disease, so its semantic value contribution is set to 1. Subsequently, we can obtain the semantic values for disease d as follows:

D(d)＝∑{D _d(k _i)}，k _i∈K；

from the above, it can be seen that the more disease pairs have the same ancestral genes, the more similar they are, so we can derive the following function for the similarity between disease pairs:

the method provides a method for measuring the functional similarity of related genes through diseases. We integrated disease ontology and disease-related genes to obtain another disease similarity, as shown below:

wherein DO1 is a group of diseases associated with disease D1, as is DO 2. The method determines a disease similarity range between 0 and 1. And finally, fusing the two similar networks to obtain a disease similar network:

stage of MIRNA similarity analysis

From the above analysis of disease similarity, it is known that the more related diseases mirnas are possessed, the greater their functional similarity. Therefore, the method considers the contributions of the relevant diseases to estimate the similarity, and the similarity between disease D and disease group DC can be expressed as:

for similarity between the two mirnas, we defined dc1 and dc2 disease groups for mirnas m1 and m2, respectively. m1 and m2 are similar as follows:

functional similarity of diseases associated with mirnas appears stable, but the stability is dependent on existing preexisting associations and therefore not very efficient. To solve such a problem, the method proposes a new method for measuring miRNA similarity by efficiently integrating validated miRNA-gene interaction and weighted functional interaction networks. Here, the method defines the value of each gene using log-likelihood scores:

wherein gene set G comprises all genes in HumanNet that interact with gene G. Based on the above analysis, this method defines the functional similarity of miRNAm1 and m2, calculated as follows:

finally, the similarity of mirnas obtained by the method is as follows:

4. learning neighborhood-based associations of mirnas with disease

MD denotes the adjacency matrix of the original association network, MDij ═ 1 if miRNAmi has a known relationship to disease dj, and MDij ═ 0 otherwise. However, MD is very rare and some unknown mirnas or disease relationships are zero, which may lead to poor performance in predicting potential associations between mirnas and disease. The method proposes to solve the problems by using a weighted K nearest neighbor algorithm:

where m1 to mk are K sequential neighbors, sorted in descending order to a similar extent to ml. θ i is the weight coefficient, μ ∈ [0,1], and using the same method, the interaction between diseases is defined as follows:

subsequently, by updating MD, a new association is obtained to replace the original matrix (MDij ═ 0):

MD＝max(MD，MD′)；

5. probability matrix decomposition of disease-related mirnas

The purpose of standard probability matrix decomposition is to model the original matrix as the product of two low rank user matrices and an item matrix. Given that the miRNA-disease matrix MD and score are in the (0-1) range, U and V are potential miRNA and disease feature matrices, and the column vectors represent feature vectors, here we mathematically express the conditional distribution problem of observed values as the following objective function:

where N (x | μ, σ ^2) is a probability density function satisfying the Gaussian distribution with mean μ and variance σ ^2, and Iij is an exponential function. If mirrnai is associated with disease j, Iij is 1, otherwise it is 0. Meanwhile, the miRNA and the disease feature vector matrix also meet zero-mean spherical Gaussian prior:

when the model converges in standard PMD, the feature vector of the miRNA with smaller value will be close to the previous mean or typical miRNA score. In order to prevent over-adaptation and significantly improve learning efficiency, a method for inhibiting a characteristic vector specific to miRNA is provided, which has strong influence on silent miRNA, and miRNA characteristic vectors are defined as follows:

where I is the indicator matrix observed, if miRNA ranks disease j, Iij-1, otherwise 0. Intuitively, column 1 of the W matrix reflects the effect of miRNA score on the prior mean of miRNA feature vectors. Subsequently, the method describes the conditional distribution of the observation scores by the following formula, which maximizes the above probability, and can further estimate the system parameters U, V from the existing observation matrix MD:

analysis shows that the above formula satisfies the conditions of Maximum Likelihood Estimation (MLE) and maximum a posteriori probability (MAP), and the rate maximization logarithm a posteriori is generalized:

where C is a constant that is independent of the parameter. Maximizing the logarithmic posterior probability over U and V and keeping the hyper-parameters constant would be equivalent to minimizing the following objective function:

therefore, we updated each miRNA and disease-related data by using a random gradient method until convergence. Then, a predicted miRNA-disease correlation matrix MD is obtained, and disease-related miRNAs are sequenced according to entities in the MD.

Drawings

FIG. 1: similarity measure of miRNA to disease

FIG. 2: wKNM-based miRNA disease association network updates

FIG. 3: CPMDM-based miRNA disease association prediction

FIG. 4: ten-major correlation network for predicting breast cancer, lung cancer and colorectal cancer through CPMDM

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to experiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

1. Experimental methods and parameter settings

In each 5-fold CV repetition, the known d-related mirnas were randomly divided into five equal-sized subsets for a given disease d, and then one subset was used as the test set, and the remaining four subsets were used as the training set. CV experiments were performed on the training data set to estimate the parameters. All parameter combinations are based on grid search. RMSE and AUC were used as evaluation indices. The impact of the parameters on the predictive model was discussed by mirnadissese-associated 5-CV evaluation. We use the 30K property (K-30) because this choice yields the best model performance on the validation set. K values in the range of [20, 60] produced similar results, and after trying different learning rate values and experimenting with different K values, we chose to use a learning rate of 0.02, since this parameter setting is good for all values of K. For the neighborhood-based associative learning approach, prior to implementing the normalized non-negative matrix factorization of the graph, it is used to update the original association matrix MD, whose purpose is to add more mutual information to help predict new mirnas or diseases, as well as those with little known connectivity. The attenuation value θ is selected from 0.1, 0.2, 0.9, 1, and the neighborhood size K is selected from 1, 2, 5. The CPMDM algorithm is as follows:

2. data set validation phase

Experiments were performed on HMDD3.2, miRCancer and dbDEMC to demonstrate the practical ability of CPMDM to predict miRNA-disease interactions. MiRCancer provides a comprehensive collection of micro rna (mirna) expression profiles of various human cancers, which are automatically extracted from the PubMed literature. All found associations have been confirmed manually after automatic extraction, so the accuracy has reached 100%. dbDEMC is an integrated database aimed at storing and displaying micrornas (mirnas) differentially expressed in human cancers detected by high-throughput methods. In dbDEMC, a total of 209 newly published datasets were collected from the gene expression complex (GEO) and The Cancer Genome Atlas (TCGA). Furthermore, in HMDD3.2, manual collection of 35547 mirnas correlated with 893 diseases including 1206 miRNA genes.

3. Experiment and analysis of results

Here, we focus on the prediction of breast, lung and colorectal cancer. And respectively combining the similarities of miRNA and diseases to establish a prediction model and realize a 5-CV experiment. These data sets were evaluated to assess their utility in the prediction of miRNA-disease interactions. As shown in the table below, 8 of the 10 largest potential miRNA candidate genes detected by CPMDM could be confirmed. The first 10 potential miRNA candidate genes of the three diseases detected by the CPMDM are respectively 5, 4 and 4 proved to be related to breast cancer, lung cancer and colorectal cancer by HMDD. In different cancers, several alternative genes play an important role in the regulation process. The network of associations of the three pre-cancerous 10 predicted miRNA candidate genes is shown in fig. 4, where some top candidate genes were observed to be associated with one or more diseases.

Table 1: CPMDM detected front 10 potential miRNA candidate genes of three diseases

HMDD ^*represent the newest version of HMDD(http://www.cuilab.cn/hmdd).

The invention provides a novel computational model CPMDM for predicting miRNA disease association to predict potential miRNA-disease association. Experimental results show that CPMDM can effectively improve the performance. In addition, the experimental results in case studies prove that the method further proves the accuracy of the calculation result of the method, and simultaneously proves that the method is a powerful tool for revealing the potential association of the new diseases and miRNA.

Claims

1. Predicting potential association of miRNA and diseases based on a constraint probability matrix decomposition method, which is characterized by comprising the following implementation steps:

(1) collecting data including interrelations between diseases, disease-related gene sets in a disease ontology, interaction relation data between miRNA genes, and association data between diseases and miRNA;

(2) analyzing the similarity of diseases, calculating the semantic similarity of disease pairs by utilizing a network disease layered directed acyclic graph, measuring the functional similarity of the diseases, and fusing the two networks to obtain a disease similarity network;

(3) analyzing the similarity of the miRNAs, considering the contribution of diseases related to the miRNAs on the basis of processing a disease similarity network, calculating the similarity between the diseases and the disease groups, and obtaining the similarity of the miRNAs related to the two disease groups on the basis;

(4) learning the correlation between miRNA and disease based on the field, analyzing the correlation data between the disease and miRNA, constructing an adjacent matrix representing a correlation network, solving the problem of sparse adjacent matrix by using a weighted K nearest neighbor algorithm, and reconstructing a new adjacent matrix;

(5) and (3) carrying out probability matrix decomposition analysis on the disease-related miRNA, and constructing a feature vector matrix of the miRNA and the disease on the basis of the adjacent matrix. And updating miRNA and disease related data by using a random gradient method, and predicting to obtain an miRNA-disease related matrix.

2. The method for predicting potential association of miRNA and disease based on constraint probability matrix decomposition as claimed in claim 1, wherein the data collection stage comprises:

(1) network descriptions are downloaded from MeSH, which is classified into 16 classes: anatomical terms a, biological B, disease C, drugs and chemicals D, etc. Based on Directed Acyclic Graphs (DAGS), various relationships between diseases are obtained from class C;

(2) collecting a disease-associated gene set (DO) in a Disease Ontology (DO) to measure functional similarity of disease to the associated gene;

(3) a weighted gene network was downloaded in human net, containing more than 50 million interactions from 17929 genes of interest, for calculation of functional similarity;

(4) 5186 associations were obtained from the HMDD2.0 database, involving 328 diseases and 495 miRNAs.

3. The method for predicting potential association of miRNA and disease based on constraint probability matrix decomposition according to claim 1 is characterized by a disease similarity analysis stage:

(1) calculating semantic similarity in disease pair references using a grid disease hierarchical directed acyclic graph, deriving a similarity function between disease pairs:

(2) the functional similarity of related genes is measured through diseases, and a new expression method of the disease similarity is obtained:

(3) wherein DO1 is a group of diseases associated with disease D1, as is DO 2. The method determines a disease similarity range between 0 and 1. And finally, fusing the two similar networks to obtain a disease similar network:

4. the method for predicting potential association of miRNA and disease based on constraint probability matrix decomposition according to claim 1 is characterized by miRNA similarity analysis stage:

(1) the more relevant diseases mirnas possess, the greater their functional similarity, so by considering the contributions of the relevant diseases, the similarity of the two mirnas is fitted;

(2) the miRNA similarity is measured by utilizing an effectively integrated miRNA-gene interaction and weighted function interaction network, and the novel method solves the problem that the functional similarity of diseases related to the miRNA is dependent on the existing connection and is low in efficiency. I.e. defining the value of each gene using log-likelihood scores:

5. the method for predicting potential association of miRNA with disease based on constraint probability matrix decomposition as claimed in claim 1, wherein learning the association of miRNA with disease based on neighborhood:

(1) constructing an adjacency matrix representing an original association network, and solving the problem of poor prediction of potential association between miRNA and diseases by adopting a weighted K nearest neighbor algorithm:

(2) using the same approach, a definition of the interaction between diseases was constructed:

(3) subsequently, by updating MD, a new association is obtained to replace the original matrix (MDij ═ 0):

MD＝max(MD，MD′)。

6. the method for predicting the potential association of miRNA and disease based on the constraint probability matrix decomposition method according to claim 1 is characterized in that the constraint probability matrix decomposition of disease-related miRNA, the analysis of conditional distribution function and zero mean to obtain a Gaussian prior, the presentation method of feature vector of miRNA is provided, and the conditional distribution of score is described. And (4) analyzing the posterior probability after the logarithm is maximized, and predicting to obtain the miRNA-disease correlation matrix.