CN110767263B - Non-coding RNA and disease associated prediction method based on sparse subspace learning - Google Patents

Non-coding RNA and disease associated prediction method based on sparse subspace learning Download PDF

Info

Publication number
CN110767263B
CN110767263B CN201910991283.3A CN201910991283A CN110767263B CN 110767263 B CN110767263 B CN 110767263B CN 201910991283 A CN201910991283 A CN 201910991283A CN 110767263 B CN110767263 B CN 110767263B
Authority
CN
China
Prior art keywords
matrix
disease
coding rna
similarity
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910991283.3A
Other languages
Chinese (zh)
Other versions
CN110767263A (en
Inventor
汤永
伍亚舟
易东
卫泽良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Military Medical University TMMU
Original Assignee
Chinese PLA General Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese PLA General Hospital filed Critical Chinese PLA General Hospital
Priority to CN201910991283.3A priority Critical patent/CN110767263B/en
Publication of CN110767263A publication Critical patent/CN110767263A/en
Application granted granted Critical
Publication of CN110767263B publication Critical patent/CN110767263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Mathematical Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Pathology (AREA)
  • Algebra (AREA)
  • Primary Health Care (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a sparse subspace learning-based non-coding RNA and disease associated prediction method, and belongs to the field of system biology. The method specifically comprises the following steps: step one, constructing a non-coding RNA-disease associated adjacency matrix, and then respectively calculating the Gaussian spectrum nuclear similarity of the non-coding RNA and the Gaussian spectrum nuclear similarity of the disease; secondly, calculating a graph theory characteristic matrix and a statistic characteristic matrix according to the two similarity matrixes and the adjacency matrix, further constructing an objective function and solving a mapping matrix G; and step three, solving a non-coding RNA-disease association pair relation score prediction matrix, and sequencing to give a final prediction result. The invention integrates graph theory, statistical method and machine learning method, can effectively utilize the information of negative samples in the non-coding RNA-disease associated data, efficiently, accurately and quickly predict the non-coding RNA with obvious correlation to the occurrence and development of diseases, and effectively solves the problems of long time consumption and high cost of a biological experiment method.

Description

Sparse subspace learning-based non-coding RNA and disease associated prediction method
Technical Field
The invention relates to the field of system biology, in particular to a non-coding RNA and disease associated prediction method based on sparse subspace learning.
Background
Non-coding RNA (ncRNA) refers to an RNA molecule that does not encode a protein in a transcriptome, and commonly includes microRNA, incrna, circRNA, and the like.
Micrornas (mirnas) are endogenous single-stranded RNAs of about 22 nucleotides in length that are present in a variety of species, including plants, animals and certain viruses. As an important posttranscriptional regulator, they inhibit gene expression and promote mRNA degradation by base-pairing with the 3' untranslated regions (UTRs) of target RNA. They play key roles in a variety of biological processes, such as cell division, differentiation, development, metabolism, infection, aging, apoptosis, and signal transduction. Experimental evidence suggests that aberrant expression of mirnas is associated with a number of human diseases. For example, up-regulated expression of miRNA 181a may trigger progression to human type 1 diabetes. In addition, hypercholesterolemia is closely associated with increased liver miR-223 levels in atherosclerotic mice. Furthermore, it has been demonstrated that miR-21, miR-494 and miR-1973 are disease response biomarkers in classical Hodgkin's lymphoma.
Long non-coding RNA (lncRNA) is RNA with the length of more than 200bp, participates in regulation and control of various biological processes, including genome epigenetic modification, regulation and control of posttranscriptional translation, enhancer RNA effect and the like, and thus plays a role in regulation and control of proliferation, differentiation, migration, apoptosis, immunity and the like of cells. Experiments show that lncRNA AC006449.2 can play the role of cancer suppressor in ovarian cancer cells. In addition, the high-expression liver cancer cells of the lncRNA H19 are in an exosome mode, the proliferation, migration and invasion capacity of adjacent liver cancer cells are enhanced, and the occurrence and development of liver cancer are promoted. Big data analysis shows that the lncRNA RP11-214F16.8 is highly expressed in the breast cancer, promotes the proliferation of breast cancer cells and further promotes the breast cancer process.
Circular RNA (circRNA) is a circular closed RNA molecule which is formed by reverse splicing and has no 5 'end cap and 3' end poly A tail, and has the characteristics of conservation, stability, tissue specificity, space-time specificity and the like. A large number of researches find that the compound can participate in the growth and development regulation of animals and the occurrence and development of diseases and the like through a plurality of mechanisms. Studies have found that forced expression of circRNA HRCR in ISO-induced myocardial hypertrophy mice can significantly alleviate myocardial hypertrophy. Experiments show that the circRNA Cdr1as can influence insulin secretion and islet B cell renewal. Colorectal cancer-related studies indicate that hsa _ circ _001988 is reduced in cancer tissues, correlating with the degree of tumor cell differentiation and prognosis.
Since non-coding RNAs affect the development and progression of a variety of human complex diseases, identification of potential ncRNA-disease associations can provide a better understanding of disease pathogenesis at the ncRNA level, which in turn facilitates disease diagnosis and treatment. However, since revealing the correlation through experimental methods is expensive and time consuming, there is a need for a novel and efficient computational method for correlation prediction. However, there are many common disadvantages of existing methods, such as failure to take global similarity into account, higher false positives related to transition components or inexactness of approximate substitutions as negative using randomized, unverified samples.
Disclosure of Invention
In order to overcome the above defects in the prior art, the present invention provides a Sparse Subspace Learning-based non-coding RNA and Disease Association prediction method (Graph regulated space Learning method for ncRNA-Disease Association prediction, GRSSL-RDA for short), which comprises calculating gaussian spectrum core similarity of ncRNA and gaussian spectrum core similarity of Disease respectively; then, calculating a map characteristic matrix and a statistic characteristic matrix according to the ncRNA-disease associated adjacency matrix, ncRNA Gaussian spectrum nuclear similarity and disease Gaussian spectrum nuclear similarity; then constructing an objective function and solving a mapping matrix G to obtain a pre-measured partial matrix of ncRNA-disease association pairs; and finally, sequencing to obtain a final prediction result. The method can accurately and efficiently predict ncRNA related to the occurrence and development of diseases according to ncRNA-disease associated data.
In order to achieve the above object, the present invention provides the following technical solution, a sparse subspace learning-based non-coding RNA and disease association prediction method, specifically comprising the following steps:
inputting known non-coding RNA-disease association pairs to construct an adjacency matrix Y;
step two, respectively calculating the Gaussian interaction spectrum nuclear similarity between diseases and the Gaussian interaction spectrum nuclear similarity between non-coding RNAs:
if there is a correlation between a disease d (i) and non-coding RNA, the corresponding position is marked as 1, otherwise, the corresponding position is marked as 0, a 1 × nm-sized row vector consisting of 0 or 1 is formed, the row vector is marked as the interaction spectrum IP (d (i)) of the disease d (i), and then, the Gaussian interaction spectrum nuclear similarity between the diseases d (i) and d (j) is calculated:
S d (d(i),d(j))=exp(-γ d ||IP(d(i))-IP(d(j))|| 2 )
in the above formula, the parameter γ d For controlling core Bandwidth by normalizing the New Bandwidth parameter γ' d Obtaining:
Figure BDA0002238376080000031
the gaussian interaction profile nuclear similarity between non-coding RNAs m (i) and m (j) is defined in a similar manner:
S m (m(i),m(j))=exp(-γ m ||IP(m(i))-IP(m(j))|| 2 )
Figure BDA0002238376080000032
wherein nd represents the number of diseases, nm represents the number of non-coding RNAs, and is taken as gamma' d =γ’ m =1;
Step three, extracting the feature vectors and synthesizing a feature matrix: from the disease similarity matrix S d Similarity matrix S of non-coding RNA m And extracting the statistical feature vector X of the disease (or non-coding RNA) from the known disease-non-coding RNA correlation matrix Y 1 (or X) 1 ') and graph theory feature vector X 2 (or X) 2 ’):
(1) Class I signature X of each disease (or non-coding RNA) 1 (or X) 1 ') includes:
(1) for disease d (i) (or non-coding RNAm (j)), the sum of the number of associations observed in the corresponding ith row (or jth column) of matrix Y;
(2) mean of similarity scores, i.e. S d (or S) m ) The average value of the ith row (or jth column) of (a);
(3) dividing the range [0,1] into n intervals, and calculating the proportion of the similarity score of d (i) (or m (j)) falling into each interval;
(2) Class II signatures X of each disease (or non-coding RNA) 2 (or X) 2 ') comprises:
①S d (or S) m ) The number of neighbors of a node in the unweighted graph;
(2) similarity values of k nearest neighbors of the node;
(3) average of class i features between k nearest neighbors;
(4) averaging the class I features centered by k nearest neighbors of the nodes in the similarity value weighted graph;
(5) from matrix S d (or S) m ) The intermediacy, proximity, feature vector centrality (Betweenness, closeness, eigen vector centricity) of the resulting node;
(6) from matrix S d (or S) m ) The obtained Page-Rank score of the node is obtained;
(3) Class III signature X of each disease-noncoding RNA pair 3 (or X) 3 ') includes:
(1) potential vectors of non-coding RNA and disease obtained by matrix Y decomposition;
(2) medians, proximity, feature vector centrality of the nodes resulting from the matrix Y (Betweenness, closeness, eigenvector centricity);
(3) the Page-Rank score of the node obtained by the matrix Y;
combining the 3 kinds of characteristics of the diseases according to rows to obtain the total characteristic matrix of the diseases, namely X d =[X 1 X 2 X 3 ];
Merging the 3 kinds of characteristics of non-coding RNA according to rows to obtain the total characteristic matrix of the non-coding RNA, namely X m =[X 1 ’ X 2 ’ X 3 ’];
Step four, constructing an objective function as follows:
Figure BDA0002238376080000041
wherein F represents the relevance score matrix (unknown, to be solved) and L is the Laplace regularization matrix of the disease or ncRNA (known, from S) d Or S m Found), U is the decision matrix (known, set as identity matrix), b is the offset (unknown, to be found),1 n is an all 1 vector, μ and λ are regularization coefficients (optimizable parameters, taking non-negative values), G is the mapping matrix (unknown, to be solved), tr () represents the trace of the solving matrix, | · |) F An F norm representing a matrix;
in the objective function, item 1 is laplacian regularization for capturing the internal manifold structure information of data, so that our method can utilize a small amount of labeled data and a large amount of unlabeled data at the same time, item 2 depicts the difference between the prediction score value and the actual correlation matrix, item 3 is subspace regression item (in this item, miRNA-disease interaction information is compressed from a high-dimensional space F to a low-dimensional space X by a projection matrix G, which belongs to the category of "subspace learning", and externally satisfies the general form of regression y = Ax + B, so named), which communicates the feature matrix and the prediction score, using F norm to minimize the difference, and item 4 finally introduces l 2,1/2 Norm, λ | G | 1/2 2,1/2 Selecting the most discriminant sparse feature according to the item;
step five, solving an objective function, and solving a matrix G by using the following iterative algorithm:
inputting: adjacency matrix Y, disease similarity matrix S d ncRNAs similarity matrix S m Non-negative regularization parameters μ and λ;
the process is as follows:
(1) From S d Or (S) m ) Respectively calculating Laplace matrix L epsilon R r×nd L = D-W, where D is a diagonal matrix, and
Figure BDA0002238376080000051
(2) According to Y, S d And S m The disease feature matrix X is calculated separately in the manner described in step three d (or ncRNA signature matrix X) m );
(3) Initializing decision matrix U e R r×nd
(4) Computing
Figure BDA0002238376080000053
(5) Calculation J = (L + U + μ A) -1
(6) Calculation M = XA (μ I- μ 2 J)AX T
(7) Calculate N = μ XAJUY;
(8) Let t =0,G 0 ∈R r×nd Is a random matrix and is characterized by that,
(9) Computing a diagonal matrix D t ,
Figure BDA0002238376080000052
Update G t+1 =(M+4λD t ) -1 N
t=t+1
Until convergence;
and (3) outputting: optimal mapping matrix G
Then calculate the matrix F = JK, where J = (L + U + μ a) -1 ,K=UY+μAX T G F, further obtaining
Figure BDA0002238376080000061
Step six, after obtaining X, G and b, calculating a prediction scoring matrix F of a non-coding RNA space m :
Figure BDA0002238376080000062
Similarly, a predictive scoring matrix F for the disease space is calculated d
Step seven, F is m And F d Performing linear combination to obtain a final prediction score matrix:
F prediction =σF m +(1-σ)F d
the combination coefficient sigma can be further searched and optimized;
and step eight, sequencing the relationship scores according to the calculated non-coding RNA-disease association pair to give a final prediction result.
The invention has the technical effects and advantages that:
1. the Laplace regularization item is introduced into a subspace learning framework, the inherent manifold structure of data is effectively captured, the prediction performance is improved, the model is a semi-supervised model, only positive samples and unmarked samples are needed without depending on negative samples, and the difficulty of model construction is greatly reduced.
2. Increase in
Figure BDA0002238376080000063
The norm constraint ensures the sparsity of the mapping matrix G, so that the influence of noise data can be weakened, and a more reliable prediction result can be obtained.
3. The method reasonably integrates graph theory, statistical method and machine learning method, can efficiently, accurately and quickly give the ncRNA-disease associated prediction result, and has better expandability and robustness.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Figure 2 is a graph of the results of five fold cross validation of the present invention on the same data set with several reported methods.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only examples of some miRNA, but not all miRNA (ncRNA also includes other species, such as lncRNA, circRNA, etc.). All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Known human miRNA-disease association data used in the examples of the present invention were retrieved from the database HMDD V2.0 and then downloaded (website http:// www. Cuilab. Cn/HMDD), and 5430 experimentally validated human miRNA-disease associations, including 383 diseases and 495 mirnas, could be obtained after washing, classifying and standardizing the downloaded data.
Then, a sparse subspace learning-based non-coding RNA and disease association prediction method shown in fig. 1 is performed, which specifically includes the following steps:
step one, inputting known miRNA-disease association pairs, and constructing an adjacency matrix Y:
Figure BDA0002238376080000071
a matrix Y with 383 rows by 495 columns and 0 or 1 element can be obtained;
step two, respectively calculating the Gaussian interaction spectrum nuclear similarity between diseases and the Gaussian interaction spectrum nuclear similarity between miRNA:
if a certain disease d (i) is associated with miRNA, marking the corresponding position as 1, otherwise marking as 0, forming a row vector consisting of 0 or 1 with the size of 1 × 495, marking as interaction spectrum IP (d (i)) of the disease d (i), and then calculating Gaussian interaction spectrum nuclear similarity between the diseases d (i) and d (j):
S d (d(i),d(j))=exp(-γ d ||IP(d(i))-IP(d(j))|| 2 )
in the above formula, the parameter γ d For controlling the core bandwidth by normalizing the new bandwidth parameter γ' d Obtaining:
Figure BDA0002238376080000072
the gaussian interaction profile nuclear similarity between mirnas m (i) and m (j) is defined in a similar manner:
S m (m(i),m(j))=exp(-γ m ||IP(m(i))-IP(m(j))|| 2 )
Figure BDA0002238376080000081
taking gamma' d =γ’ m =1;
Where nd represents the number of diseases, here 383 nm represents the number of miRNAs, here 495, which is calculated to give a size of 38Symmetric matrix S of 3 x 383 d And a symmetric matrix S of size 495 x 495 m And S is d And S m Each element of (a) is between 0 and 1, e.g., d (12) Aging has a similarity of 0.7028 to d (18) Amyotrophic Laterial Sclerosing (ALS), and m (186) hsa-mir-539 and m (424) hsa-mir-4792 have a similarity of 0.5787;
step three, extracting the feature vectors and synthesizing a feature matrix: from the disease similarity matrix S d miRNA similarity matrix S m And extracting statistical quantity characteristic vector X of disease (or miRNA) from known disease-miRNA incidence matrix Y 1 (or X) 1 ') and graph theory feature vector X 2 (or X) 2 ’):
(1) Class I signature X of each disease (or miRNA) 1 (or X) 1 ') comprises:
(1) for disease d (i) (or miRNAm (j)), the sum of the number of associations observed in the corresponding ith row (or jth column) of matrix Y;
(2) mean of similarity scores, i.e. S d (or S) m ) The average value of the ith row (or jth column) of (a);
(3) dividing the range [0,1] into n =5 intervals, calculating the proportion of d (i) (or m (j)) in which the similarity score falls in each interval;
after this step, a class I feature matrix X of the disease is generated 1 (size is 383 row X (1 + 5) column), type I feature matrix X of miRNA 1 ' (column of size 495 row x (1 + 5));
(2) Class II signatures X of each disease (or miRNA) 2 (or X) 2 ') includes:
①S d (or S) m ) The number of neighbors of a node in the unweighted graph;
(2) similarity values for k =10 nearest neighbors of a node;
(3) k = mean of class i features between 10 nearest neighbors;
(4) node k =10 mean values of the class i features centered in the nearest neighbors in the similarity value weighted graph;
(5) from matrix S d (or S) m ) Obtained byThe intermediacy, proximity, feature vector centrality (Betweenness, close, eigenvector centricity);
(6) from matrix S d (or S) m ) The obtained Page-Rank score of the node is obtained;
after this step, a class II feature matrix X of the disease is generated 2 (size is 383 row X (1 +10 +7+3+ 1) column), class II feature matrix X of miRNA 2 ' (column of size 495 row x (1 +10 +7+3+ 1));
(3) Class III signatures X of each disease-miRNA pair 3 (or X) 3 ') comprises:
(1) potential vectors of miRNA and disease obtained by matrix Y decomposition (top 5 columns);
(2) medians, proximity, feature vector centrality of the nodes resulting from the matrix Y (Betweenness, closeness, eigenvector centricity);
(3) the Page-Rank score of the node obtained by the matrix Y;
after the calculation, a class III feature matrix X of the disease is generated 3 (size is 383 row X (5 +3+ 1) column), type III feature matrix X of miRNA 3 ' (column size 495 row x (5 +3+ 1));
merging the 3 types of disease features by row, i.e. X m =[X 1 X 2 X 3 ]Obtaining the total characteristic matrix X of the disease m (383 rows × 45 columns);
merging the 3 classes of miRNA features by row, i.e. X d =[X 1 ’X 2 ’X 3 ’]Obtaining the total feature matrix X of miRNA d (495 rows and 45 columns);
step four, constructing an objective function as follows:
Figure BDA0002238376080000091
wherein F represents a correlation score matrix to be predicted (unknown, to be solved), and L is a Laplace regularization matrix of disease or miRNA (known, from S) d Or S m Found), U is a decision matrix (known, here set to be an identity matrix), b isOffset (unknown, to be solved), 1 n Is an all 1 vector, μ and λ are regularization coefficients (optimizable parameters, here directly assigned to 1), G is the mapping matrix (unknown, to be solved), tr () represents the trace of the solving matrix, | · |, where F An F-norm representing a matrix;
solving an objective function, and solving a matrix G by using the following iterative algorithm;
inputting: adjacency matrix Y, disease similarity matrix S d (or miRNAs similarity matrix S) m ) Non-negative regularization parameters μ and λ (both set directly here to 1);
the process is as follows:
(1) From S d (or S) m ) Respectively calculating corresponding Laplace matrix L epsilon R r×nd L = D-W, where D is the diagonal moment
Figure BDA0002238376080000101
Array, and
(2) According to Y, S d And S m The disease feature matrix X is calculated separately in the manner described in step 3 d (or miRNA signature matrix X) m );
(3) Initializing decision matrix U e R r×nd
(4) Computing
Figure BDA0002238376080000104
(5) Calculation J = (L + U + μ A) -1
(6) Calculation M = XA (μ I- μ) 2 J)AX T
(7) Calculate N = μ XAJUY;
(8) Let t =0,G 0 ∈R r×nd Is a random matrix and is characterized by that,
(9) Computing a diagonal matrix D t ,
Figure BDA0002238376080000102
Update G t+1 =(M+4λD t ) -1 N
t=t+1
Until convergence
And (3) outputting: optimal mapping matrix G
When the matlab programming is used for realizing the algorithm, the mapping matrix G is initialized to a random matrix of 100 rows by 383 (or 495) columns, and the iteration cycle number is set to be 1000 or the requirement is met
Figure BDA0002238376080000103
Exiting the iteration loop, and obtaining a matrix G after the operation is finished;
then, the matrix F, F = JK is calculated, where J = (L + U + μ a) -1 ,K=UY+μAX T G is then obtained
Figure BDA0002238376080000111
Step six, after obtaining X, G and b, calculating a prediction scoring matrix F of miRNA space m :
Figure BDA0002238376080000112
Similarly, a prediction scoring matrix F for the disease space is calculated d
Step seven, F is m And F d Performing linear combination to obtain a final prediction score matrix:
F prediction =σF m +(1-σ)F d
the combination coefficient sigma is optimized, and is taken as 0.9;
and step eight, sequencing the relation scores according to the calculated miRNA-disease association pair to give a final prediction result.
The validity of the invention is verified:
a sparse subspace learning-based non-coding RNA and disease associated prediction method as shown in FIG. 1 adopts quintuple cross validation for prediction evaluation, and is carried out in such a way that: all known miRNA-disease associations were randomly and evenly divided into 5 groups, and then each of the 5 groups was set as a test sample, and the other groups were used as training samples.
Thus, the predicted outcome is obtained using the training sample as an input to the method, and finally the predicted score for each test sample in the set is compared to the score for the candidate miRNA. To reduce the effect that random partitioning could have on the process of obtaining test samples, 100 five-fold cross-validation was performed.
The following data, shown in fig. 2, are then obtained, in particular for the performance comparison between GRSSLRDA of the present method and the most advanced existing prediction models of association between several diseases, mirnas. The method obtains 0.9030 +/-0.0005972 AUROC (area under ROC curve) in 5-fold cross validation, and shows more excellent prediction performance than the past several classical models.
On the other hand, for a specific disease, such as Lung cancer (Lung cancer), based on the known correlation in HMDD V2.0, the miRNA-Lung cancer correlation prediction is carried out by using the method, and 48 of the first 50 miRNAs in the obtained result can be supported by an external database.
The first and third columns in the table below represent the first 1-25 and first 26-50 related mirnas in the predicted outcome, respectively. In the table, I, II and III represent three external databases of dbDEMC, miR2Disease and HMDD v3.0 respectively.
Figure BDA0002238376080000121
And finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims (2)

1. The non-coding RNA and disease associated prediction method based on sparse subspace learning is characterized by comprising the following steps:
inputting known non-coding RNA-disease association pairs to construct an adjacency matrix Y;
step two, respectively calculating the Gaussian interaction spectrum nuclear similarity between diseases and the Gaussian interaction spectrum nuclear similarity between non-coding RNAs:
if there is a correlation between a disease d (i) and non-coding RNA, the corresponding position is marked as 1, otherwise, the corresponding position is marked as 0, a 1 × nm-sized row vector consisting of 0 or 1 is formed, the row vector is marked as the interaction spectrum IP (d (i)) of the disease d (i), and then, the Gaussian interaction spectrum nuclear similarity between the diseases d (i) and d (j) is calculated:
S d (d(i),d(j))=exp(-γ d ||IP(d(i))-IP(d(j))|| 2 )
in the above formula, the parameter γ d For controlling the core bandwidth by normalizing the new bandwidth parameter γ' d Obtaining:
Figure FDA0002238376070000011
the gaussian interaction spectrum kernel similarity between non-coding RNAm (i) and m (j) is defined in a similar manner:
S m (m(i),m(j))=exp(-γ m ||IP(m(i))-IP(m(j))|| 2 )
Figure FDA0002238376070000012
wherein nd represents the number of diseases, nm represents the number of non-coding RNAs, and is taken as gamma' d =γ′ m =1;
Step three, extracting the feature vectors and synthesizing a feature matrix: from the disease similarity matrix S d Similarity matrix S of non-coding RNA m Extracting the statistical quantity characteristic vector X of the disease (or non-coding RNA) from the known disease-non-coding RNA correlation matrix Y 1 (or X) 1 ') and graph theory feature vector X 2 (or X) 2 ’):
(1) Class I signature X of each disease (or non-coding RNA) 1 (or X) 1 ') includes:
(1) for disease d (i) (or non-coding RNAm (j)), the sum of the number of associations observed in the corresponding ith row (or jth column) of matrix Y;
(2) mean of similarity scores, i.e. S d (or S) m ) Row i (or column j) average:
(3) dividing the range [0,1] into n intervals, and calculating the proportion of the similarity score of d (i) (or m (j)) falling into each interval;
(2) Class II signatures X of each disease (or non-coding RNA) 2 (or X) 2 ') includes:
①S d (or S) m ) The number of neighbors of a node in the unweighted graph;
(2) similarity values of k nearest neighbors of the node;
(3) the mean of class i features between k nearest neighbors;
(4) averaging the class I features centered by k nearest neighbors of the nodes in the similarity value weighted graph;
(5) node intermediacy, proximity, feature vector centrality (Betweenness, closeness, eigenvector centricity);
(6) the Page-Rank score of the node;
(3) Class III signature X of each disease-noncoding RNA pair 3 (or X) 3 ') includes:
(1) potential vectors of non-coding RNA and disease obtained by matrix Y decomposition;
(2) node intermediacy, proximity, feature vector centrality (Betweenness, closeness, eigenvector centricity);
(3) the Page-Rank score of the node;
combining the 3 types of characteristics of the diseases according to rows to obtain a total characteristic matrix of the diseases, namely X d =[X 1 X 2 X 3 ];
Combining the 3 types of characteristics of the non-coding RNA according to rows to obtain a total characteristic matrix, namely X, of the non-coding RNA m =[X 1 ’ X 2 ’ X 3 ’];
Step four, based on subspace learning, integrating Laplace regularization and l 2,1/2 Constructing an objective function by using the norm constraint term;
step five, solving the objective function to obtain the characteristics of the non-coding RNAMatrix X m Mapping matrix G and offset matrix b, and calculating prediction scoring matrix F of non-coding RNA space m :
Figure FDA0002238376070000021
Step six, similarly, a predictive scoring matrix F for the disease space is calculated d
Step seven, F is m And F d Performing linear combination to obtain a final prediction score matrix:
F prediction =σF m +(1-σ)F d
the combination coefficient sigma can be used for further optimization of grid search;
and step eight, sequencing the relationship scores according to the calculated non-coding RNA-disease association pair to give a final prediction result.
2. The sparse subspace learning based non-coding RNA and disease associated prediction method of claim 1, wherein the objective function in the fourth step is a subspace regression with Laplace regularization and l integrated 2,1/2 The norm constraint term specifically includes:
Figure FDA0002238376070000031
wherein F represents a correlation score matrix (unknown, to be solved), L is a Laplacian matrix of disease or non-coding RNAs (known, can be represented by S) d Or S m Found), U is the decision matrix (known, set as identity matrix), b is the offset (unknown, to be found), 1 n Is an all 1 vector, μ and λ are regularization coefficients (non-negative values, further grid search optimization), G is a mapping matrix (unknown, to be solved), tr () represents the trace of the solving matrix, | · |) F An F-norm representing a matrix;
item 1 is a Laplace regularization term in the objective function to capture the inherent manifold structure of the dataInformation, item 2 is a difference quantification item, which characterizes the difference between the pre-measured score and the actual correlation matrix, item 3 is a subspace regression item, which links the feature matrix and the prediction score matrix, and uses the F norm to measure the difference, and item 4 is l 2,1/2 And (4) limiting the item by norm so as to select the sparse feature with the most discriminability.
CN201910991283.3A 2019-10-18 2019-10-18 Non-coding RNA and disease associated prediction method based on sparse subspace learning Active CN110767263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910991283.3A CN110767263B (en) 2019-10-18 2019-10-18 Non-coding RNA and disease associated prediction method based on sparse subspace learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910991283.3A CN110767263B (en) 2019-10-18 2019-10-18 Non-coding RNA and disease associated prediction method based on sparse subspace learning

Publications (2)

Publication Number Publication Date
CN110767263A CN110767263A (en) 2020-02-07
CN110767263B true CN110767263B (en) 2022-12-06

Family

ID=69332660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910991283.3A Active CN110767263B (en) 2019-10-18 2019-10-18 Non-coding RNA and disease associated prediction method based on sparse subspace learning

Country Status (1)

Country Link
CN (1) CN110767263B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599403B (en) * 2020-05-22 2023-03-14 电子科技大学 Parallel drug-target correlation prediction method based on sequencing learning
CN112820353B (en) * 2021-01-22 2023-10-03 中山大学 Method and system for analyzing cell fate conversion key transcription factors
CN116888671A (en) * 2021-09-29 2023-10-13 京东方科技集团股份有限公司 RNA-protein interaction prediction method, device, medium and electronic equipment
CN116583905B (en) * 2021-11-23 2024-05-10 染色质(北京)科技有限公司 Method for generating enhanced Hi-C matrix, method for identifying structural chromatin aberration in enhanced Hi-C matrix and readable medium
CN114944192B (en) * 2022-06-22 2023-06-30 湖南科技大学 Disease-related annular RNA identification method based on graph attention
CN115966252B (en) * 2023-02-12 2024-01-19 中国人民解放军总医院 Antiviral drug screening method based on L1norm diagram
CN117172294B (en) * 2023-11-02 2024-01-26 烟台大学 Method, system, equipment and storage medium for constructing sparse brain network
CN117936079A (en) * 2024-03-21 2024-04-26 中国人民解放军总医院第三医学中心 Manifold learning-based diabetic retinopathy identification method, medium and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109698029A (en) * 2018-12-24 2019-04-30 桂林电子科技大学 A kind of circRNA- disease association prediction technique based on network model
CN109935332A (en) * 2019-03-01 2019-06-25 桂林电子科技大学 A kind of miRNA- disease association prediction technique based on double random walk models

Also Published As

Publication number Publication date
CN110767263A (en) 2020-02-07

Similar Documents

Publication Publication Date Title
CN110767263B (en) Non-coding RNA and disease associated prediction method based on sparse subspace learning
CN110556184B (en) Non-coding RNA and disease relation prediction method based on Hessian regular nonnegative matrix decomposition
Fan et al. lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning
Guo et al. A learning-based method for LncRNA-disease association identification combing similarity information and rotation forest
Yang et al. iCircRBP-DHN: identification of circRNA-RBP interaction sites using deep hierarchical network
Wang et al. IMS-CDA: prediction of CircRNA-disease associations from the integration of multisource similarity information with deep stacked autoencoder model
Yu et al. MCLPMDA: A novel method for mi RNA‐disease association prediction based on matrix completion and label propagation
Li et al. SCMFMDA: predicting microRNA-disease associations based on similarity constrained matrix factorization
Wen et al. A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network
Wang et al. Computational identification of human long intergenic non-coding RNAs using a GA–SVM algorithm
Deepthi et al. An ensemble approach for CircRNA-disease association prediction based on autoencoder and deep neural network
Deng et al. Predicting gene ontology function of human MicroRNAs by integrating multiple networks
Ji et al. A semi-supervised learning method for MiRNA-disease association prediction based on variational autoencoder
Li et al. GCAEMDA: Predicting miRNA-disease associations via graph convolutional autoencoder
CN108427865B (en) Method for predicting correlation between LncRNA and environmental factors
Ji et al. DANE-MDA: Predicting microRNA-disease associations via deep attributed network embedding
Shujaat et al. Cr-prom: A convolutional neural network-based model for the prediction of rice promoters
Chakraborty et al. Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture
Li et al. ScGSLC: an unsupervised graph similarity learning framework for single-cell RNA-seq data clustering
CN115640529A (en) Novel circular RNA-disease association prediction method
Wu et al. An ensemble learning framework for potential miRNA-disease association prediction with positive-unlabeled data
Asim et al. EL-RMLocNet: An explainable LSTM network for RNA-associated multi-compartment localization prediction
Ouyang et al. Predicting multiple types of associations between miRNAs and diseases based on graph regularized weighted tensor decomposition
Gao et al. A new method based on matrix completion and non-negative matrix factorization for predicting disease-associated miRNAs
Lu et al. A constrained probabilistic matrix decomposition method for predicting miRNA-disease associations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221026

Address after: No.28, Fuxing Road, Haidian District, Beijing 100036

Applicant after: CHINESE PLA GENERAL Hospital

Address before: Chongqing city Shapingba street 400038 gaotanyan No. 30

Applicant before: THIRD MILITARY MEDICAL University

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230328

Address after: Chongqing city Shapingba street 400038 gaotanyan No. 30

Patentee after: THIRD MILITARY MEDICAL University

Address before: No.28, Fuxing Road, Haidian District, Beijing 100036

Patentee before: CHINESE PLA GENERAL Hospital