CN116052873B - Disease-metabolite association prediction system based on weight k-nearest neighbor - Google Patents
Disease-metabolite association prediction system based on weight k-nearest neighbor Download PDFInfo
- Publication number
- CN116052873B CN116052873B CN202310059889.XA CN202310059889A CN116052873B CN 116052873 B CN116052873 B CN 116052873B CN 202310059889 A CN202310059889 A CN 202310059889A CN 116052873 B CN116052873 B CN 116052873B
- Authority
- CN
- China
- Prior art keywords
- matrix
- disease
- similarity
- metabolite
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 201000010099 disease Diseases 0.000 title claims abstract description 185
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 185
- 239000002207 metabolite Substances 0.000 title claims abstract description 185
- 239000011159 matrix material Substances 0.000 claims abstract description 244
- 230000004927 fusion Effects 0.000 claims abstract description 73
- 230000003595 spectral effect Effects 0.000 claims abstract description 39
- 230000003044 adaptive effect Effects 0.000 claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 28
- 239000013598 vector Substances 0.000 claims description 19
- 238000010276 construction Methods 0.000 claims description 14
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 12
- 208000037920 primary disease Diseases 0.000 claims 1
- 238000004364 calculation method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 206010009944 Colon cancer Diseases 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 3
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 3
- 206010028980 Neoplasm Diseases 0.000 description 3
- 208000008589 Obesity Diseases 0.000 description 3
- KXGVEGMKQFWNSR-LLQZFEROSA-N deoxycholic acid Chemical compound C([C@H]1CC2)[C@H](O)CC[C@]1(C)[C@@H]1[C@@H]2[C@@H]2CC[C@H]([C@@H](CCC(O)=O)C)[C@@]2(C)[C@@H](O)C1 KXGVEGMKQFWNSR-LLQZFEROSA-N 0.000 description 3
- 210000004185 liver Anatomy 0.000 description 3
- 201000005202 lung cancer Diseases 0.000 description 3
- 208000020816 lung neoplasm Diseases 0.000 description 3
- 235000020824 obesity Nutrition 0.000 description 3
- WYHIICXRPHEJKI-UHFFFAOYSA-N Trientine hydrochloride Chemical compound Cl.Cl.NCCNCCNCCN WYHIICXRPHEJKI-UHFFFAOYSA-N 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 229960003964 deoxycholic acid Drugs 0.000 description 2
- KXGVEGMKQFWNSR-UHFFFAOYSA-N deoxycholic acid Natural products C1CC2CC(O)CCC2(C)C2C1C1CCC(C(CCC(O)=O)C)C1(C)C(O)C2 KXGVEGMKQFWNSR-UHFFFAOYSA-N 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 231100000784 hepatotoxin Toxicity 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000002503 metabolic effect Effects 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 230000001502 supplementing effect Effects 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- HSINOMROUCMIEA-FGVHQWLLSA-N (2s,4r)-4-[(3r,5s,6r,7r,8s,9s,10s,13r,14s,17r)-6-ethyl-3,7-dihydroxy-10,13-dimethyl-2,3,4,5,6,7,8,9,11,12,14,15,16,17-tetradecahydro-1h-cyclopenta[a]phenanthren-17-yl]-2-methylpentanoic acid Chemical group C([C@@]12C)C[C@@H](O)C[C@H]1[C@@H](CC)[C@@H](O)[C@@H]1[C@@H]2CC[C@]2(C)[C@@H]([C@H](C)C[C@H](C)C(O)=O)CC[C@H]21 HSINOMROUCMIEA-FGVHQWLLSA-N 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 208000024172 Cardiovascular disease Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 150000001450 anions Chemical class 0.000 description 1
- 239000003613 bile acid Substances 0.000 description 1
- 210000000013 bile duct Anatomy 0.000 description 1
- 208000026106 cerebrovascular disease Diseases 0.000 description 1
- 208000029742 colonic neoplasm Diseases 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 229940009976 deoxycholate Drugs 0.000 description 1
- 230000002526 effect on cardiovascular system Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 210000001035 gastrointestinal tract Anatomy 0.000 description 1
- 230000002440 hepatic effect Effects 0.000 description 1
- 210000003494 hepatocyte Anatomy 0.000 description 1
- 208000026278 immune system disease Diseases 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000000968 intestinal effect Effects 0.000 description 1
- 208000019423 liver disease Diseases 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 230000008506 pathogenesis Effects 0.000 description 1
- 210000003240 portal vein Anatomy 0.000 description 1
- 210000000813 small intestine Anatomy 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 231100000331 toxic Toxicity 0.000 description 1
- 230000002588 toxic effect Effects 0.000 description 1
- 231100000765 toxin Toxicity 0.000 description 1
- 239000003053 toxin Substances 0.000 description 1
- 230000004614 tumor growth Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Public Health (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A disease-metabolite association prediction system based on weight k-nearest neighbor relates to the technical field of bioinformatics. The invention aims to solve the problem that the existing method for acquiring the relationship between the metabolite and the disease has low prediction efficiency. The invention comprises the following steps: acquiring a Jaccard similarity matrix between diseases and metabolites, and a self-adaptive spectral clustering similarity matrix and a Cosine similarity matrix; obtaining a first disease similarity fusion matrix and a first metabolite similarity fusion matrix by using a Jaccard similarity matrix and an adaptive spectral clustering similarity matrix between the disease and the metabolites; obtaining a second disease similar fusion matrix and a second metabolite similar fusion matrix by fusion of a Cosine similarity matrix and a first disease similar fusion matrix between metabolites and diseases; constructing a disease-metabolite first network; constructing a final prediction score matrix; a predictive score for the disease and metabolite of the relationship to be predicted is obtained. The invention is useful for predicting the association between a disease and a metabolite.
Description
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a disease-metabolite association prediction system based on a weight k-nearest neighbor.
Background
During long-term evolution, biological organisms interact with the surrounding environment, a process of absorbing and rejecting substances and energy, known as metabolism. It acts as an important vital activity of the organism, playing a vital role in the process of substance and energy variation. More and more biological and medical experiments have shown that certain metabolite concentrations in some patients differ from those in healthy individuals. Deoxycholic acid is a secondary bile acid produced by the liver and is recirculated through the liver, bile duct, small intestine and portal vein to form the enterohepatic circuit. At physiological pH values, they are strongly toxic in the form of anions, and therefore a carrier is required for transport across intestinal and hepatic tissue membranes. When the deoxycholate content is sufficiently high, it can act as a hepatotoxin, a metabolic toxin and a tumor metabolite. Hepatotoxins can cause damage to the liver or hepatocytes. When at high levels for long periods of time, it can promote tumor growth and survival. In addition to being associated with liver disease, long-term high levels of deoxycholic acid are also associated with a variety of cancers, such as colon cancer, breast cancer and many other cancers of the gastrointestinal tract. Furthermore, the pathogenesis of cardiovascular and cerebrovascular diseases and some immune diseases has also been shown to be related to metabolites. Therefore, diagnosis of metabolic-based diseases is an important judgment in medical diagnosis.
The existing method for acquiring the relationship between the metabolite and the disease is mainly realized by proposing a mode of carrying out biological experiments, however, the biological experiments not only waste a lot of human resources, but also require a lot of time, thereby causing the problem of low prediction efficiency of the existing method for acquiring the relationship between the metabolite and the disease.
Disclosure of Invention
The invention aims to solve the problem that the existing method for acquiring the relation between the metabolite and the disease is low in prediction efficiency, and provides a disease-metabolite association prediction system based on weight k-nearest neighbor.
A disease-metabolite association prediction system based on weight k-nearest neighbor comprising: the system comprises a disease-metabolite correlation adjacency matrix acquisition module, a Jaccard similarity acquisition module, an adaptive spectral clustering similarity acquisition module, a first similarity fusion matrix acquisition module, a Cosine similarity acquisition module, a second similarity fusion matrix acquisition module, a disease-metabolite first network construction module, a final prediction score matrix construction module and a correlation acquisition module;
the disease-metabolite correlation adjacency matrix acquisition module is used for constructing an original disease-metabolite correlation bipartite network according to the known disease-metabolite correlation relationship and establishing a correlation adjacency matrix Y by utilizing the original disease-metabolite correlation bipartite network DM ;
The Jaccard similarity acquisition module is used for acquiring a correlation adjacency matrix Y DM Acquiring a Jaccard similarity matrix DJ between diseases and a Jaccard similarity matrix MJ between metabolites;
the self-adaptive spectral clustering similarity acquisition module is used for acquiring a correlation adjacency matrix Y according to the correlation adjacency matrix Y DM Acquiring an adaptive spectral clustering similarity matrix DS between diseases and an adaptive spectral clustering similarity matrix MS between metabolites;
the first similarity fusion matrix acquisition module is configured to fuse the Jaccard similarity matrix DJ between diseases with the adaptive spectral clustering similarity matrix DS between diseases to obtain a first disease similarity fusion matrix dsjs, and fuse the Jaccard similarity matrix MJ between metabolites with the adaptive spectral clustering similarity matrix MS between metabolites to obtain a first metabolite similarity fusion matrix MJs;
the similarity acquisition module is used for acquiring a correlation adjacency matrix Y according to the correlation adjacency matrix Y DM Acquiring a Cosine similarity matrix DC between diseases and a Cosine similarity matrix MC between metabolites;
the second similarity fusion matrix acquisition module is used for fusing a Cosine similarity matrix DC among diseases with a first disease similarity fusion matrix DJS to obtain a second disease similarity fusion matrix DJSC, and fusing a Cosine similarity matrix MC among metabolites with a first metabolite similarity fusion matrix MJS to obtain a second metabolite similarity fusion matrix MJSC;
the disease-metabolite first network construction module is used for constructing a disease-metabolite first network Y by adopting a weighted k-nearest neighbor algorithm and utilizing an original disease-metabolite association bipartite network, a second disease similar fusion matrix DJSC and a second metabolite similar fusion matrix MJSC new ;
The final prediction score matrix module is used for constructing a final prediction score matrix SNWKCP by using a disease-metabolite first network, a second disease similarity fusion matrix DJSC and a second metabolite similarity fusion matrix MJSC;
the relevance acquisition module is used for searching the predictive score of the diseases and the metabolites of the relation to be predicted in the final predictive score matrix SNWKCP, wherein the higher the score is, the higher the relevance of the diseases and the metabolites is;
the predictive score is in the range of 0 to 1.
A weight k-nearest neighbor based disease-metabolite association prediction storage medium for storing at least one instruction for implementing a weight k-nearest neighbor based disease-metabolite association prediction system.
The beneficial effects of the invention are as follows:
the invention adopts the disease-metabolite adjacent matrix to respectively carry out Jaccard similarity calculation, self-adaptive spectral clustering similarity calculation and Cosine similarity calculation on the disease and the metabolite, thereby obtaining the Jaccard similarity matrix, the self-adaptive spectral clustering similarity matrix and the Cosine similarity matrix of the disease-disease and the metabolite-metabolite; according to the similarity matrix integration, a second disease similarity fusion matrix DJSC and a second metabolite similarity fusion matrix MJSC are obtained; then calculating by a weighted k-nearest neighbor algorithm to obtain a disease-metabolite first network; then calculating a final disease-metabolite association prediction score matrix by using vector projection; the present invention discloses a hidden association between disease and metabolites. The invention obtains the relation between the metabolite and the disease by utilizing the final score matrix, avoids the waste of human resources and time, and improves the prediction efficiency.
Drawings
FIG. 1 is a general flow chart for constructing a disease-metabolite association relationship;
FIG. 2 is a detailed process diagram of the construction of disease-metabolite associations;
FIG. 3 is a diagram of a matrix construction according to disease-metabolite associations;
FIG. 4 is a disease similarity matrix construction diagram calculated from a disease-metabolite correlation matrix;
FIG. 5 is a diagram of metabolite similarity matrix construction calculated from a disease-metabolite correlation matrix;
FIG. 6 is a ROC diagram of SNWKCP-DMA model under the 5-fold cross validation framework.
Detailed Description
The first embodiment is as follows: as shown in fig. 1-2, the disease-metabolite association prediction system based on the weight k-nearest neighbor of the present embodiment includes: the system comprises a disease-metabolite correlation adjacency matrix acquisition module, a Jaccard similarity acquisition module, an adaptive spectral clustering similarity acquisition module, a first similarity fusion matrix acquisition module, a Cosine similarity acquisition module, a second similarity fusion matrix acquisition module, a disease-metabolite first network construction module, a final prediction score matrix construction module and a correlation acquisition module;
the disease-metabolite correlation adjacency matrix acquisition module is used for constructing an original disease-metabolite correlation bipartite network according to the known disease-metabolite correlation relationship and establishing a correlation adjacency matrix Y by utilizing the original disease-metabolite correlation bipartite network DM ;
The Jaccard similarity acquisition module is used for acquiring a correlation adjacency matrix Y DM Acquiring a Jaccard similarity matrix between diseases and a Jaccard similarity matrix between metabolites;
the self-adaptive spectral clustering similarity acquisition module is used for acquiring a correlation adjacency matrix Y according to the correlation adjacency matrix Y DM Acquiring an adaptive spectral clustering similarity matrix between diseases and an adaptive spectral clustering similarity matrix between metabolites;
the first similarity fusion matrix acquisition module is used for fusing the Jaccard similarity matrix among diseases and the adaptive spectral clustering similarity matrix among diseases to obtain a first disease similarity fusion matrix, and fusing the Jaccard similarity matrix among metabolites and the adaptive spectral clustering similarity matrix among metabolites to obtain a first metabolite similarity fusion matrix;
the similarity acquisition module is used for acquiring a correlation adjacency matrix Y according to the correlation adjacency matrix Y DM Acquiring a Cosine similarity matrix between diseases and a Cosine similarity matrix between metabolites;
the second similarity fusion matrix acquisition module is used for fusing a similarity matrix between diseases and the first disease similarity fusion matrix to obtain a second disease similarity fusion matrix, and fusing the similarity matrix between metabolites and the first metabolite similarity fusion matrix to obtain a second metabolite similarity fusion matrix;
the disease-metabolite first network construction module is used for constructing a disease-metabolite first network by adopting a weighted k-nearest neighbor algorithm and utilizing an original disease-metabolite association bipartite network, a second disease similar fusion matrix and a second metabolite similar fusion matrix;
the final prediction score matrix module is used for constructing a final prediction score matrix by utilizing a disease-metabolite first network, a second disease similarity fusion matrix and a second metabolite similarity fusion matrix;
the relevance acquisition module is used for searching the predictive score of the diseases and the metabolites of the relation to be predicted in the final predictive score matrix SNWKCP, wherein the higher the score is, the higher the relevance of the diseases and the metabolites is;
the predictive score is in the range of 0 to 1.
The second embodiment is as follows: the disease-metabolite correlation adjacency matrix acquisition module is used for constructing a bipartite network by utilizing the known disease-metabolite correlation relationship and establishing a correlation adjacency matrix Y by utilizing the bipartite network DM The following formula:
Y DM ={Y(i,j)} r*n
where r represents the number of disease species, n represents the number of metabolite species, and Y (i, j) is the original disease-metabolite association bipartite network, with particular reference to fig. 3.
And a third specific embodiment: the Jaccard similarity acquisition module is used for a correlation adjacency matrix Y DM Obtaining Jaccard similarity between diseases and Jaccard similarity between metabolites, comprising the steps of:
as shown in FIG. 4Specifically, the Jaccard similarity method between diseases is shown as follows, and two row vectors Y (d i ) And Y (d) i’ ). Then, the number of metabolites which are correlated with the two and the number of metabolites which are correlated with the two are calculated respectively to obtain the disease d i And disease d i’ Similarity between them;
disease-disease Jaccard similarity matrix DJ, DJ (d i ,d i’ ) The formula is as follows:
wherein d i And d i’ Representing two different diseases, Y (d) i ) And Y (d) i’ ) Respectively express and treat the disease d i And disease d i’ Number of related metabolite sets.
As shown in FIG. 5, the method of specifically calculating Jaccard similarity is as follows, finding its two column vectors Y (m j’ ) And Y (m) j ). Then, the number of diseases related to the two are calculated respectively, and the number of diseases related to the two are calculated, thereby obtaining a metabolite m j’ With metabolite m j Similarity between them;
the metabolite-metabolite Jaccard similarity matrix is MJ, where MS (m j ,m j’ ) The method comprises the following steps:
wherein m is j’ And m j Respectively represent two different metabolites, Y (m j’ ) And Y (m) j ) Respectively represent and metabolite m j’ And metabolite m j Number of related disease sets.
The specific embodiment IV is as follows: the self-adaptive spectral clustering similarity acquisition module is used for acquiring a correlation adjacency matrix Y according to the correlation adjacency matrix Y DM Acquisition of diseaseThe self-adaptive spectral clustering similarity and the self-adaptive spectral clustering similarity among metabolites are specifically as follows:
as shown in fig. 4, the method of specifically calculating the adaptive spectral cluster similarity is as follows, first, two row vectors Y (d i ) And Y (d) i’ ). Then, calculating the full-connection Euclidean distance, then calculating sigma of the 'Kth' point with the nearest Euclidean distance by using KNN, and finally constructing a similarity matrix;
element DS (d) in disease-disease adaptive spectral clustering similarity matrix DS i ,d i’ ) The formula is as follows:
δ x =||Y(d x )-Y(d xK )||
wherein Y (d) xK ) Is Y (d) x ) K-th neighbor of sample point, matrix Y DM The i-th and i' -th row vectors of (a) are denoted as Y (d) i ) And Y (d) i’ ) X is i or i', K is a constant greater than 0, delta x Is an intermediate variable, delta x Is an intermediate variable, i and i' are Y DM Is a row vector label of (c).
As shown in fig. 5, the method of specifically calculating the adaptive spectral cluster similarity is as follows, first, two column vectors Y (m j’ ) And Y (m) j ) And, a method for producing the same. Then, calculating the full-connection Euclidean distance, then calculating sigma of the 'Kth' point with the nearest Euclidean distance by using KNN, and finally constructing a similarity matrix, wherein the similarity matrix is specifically as follows:
element MS (m) in metabolite-metabolite adaptive spectral clustering similarity matrix MS j ,m j’ ) The following formula:
δ x’ =||Y(m x’ )-Y(m x’K )||
wherein Y (m) x’K ) Is Y (m) x’ ) K-th neighbor of sample point, matrix Y DM The j 'th and j' th column vectors are denoted as Y (m j’ ) And Y (m) j ) X ' is j or j ', j and j ' are Y DM Is a column vector label of (c).
Fifth embodiment: the first similarity fusion matrix acquisition module is used for fusing a Jaccard similarity matrix between diseases and an adaptive spectral clustering similarity matrix between diseases to obtain a first disease similarity fusion matrix, and fusing a Jaccard similarity matrix between metabolites and an adaptive spectral clustering similarity matrix between metabolites to obtain a first metabolite similarity fusion matrix, and specifically comprises the following steps:
if it passes through the disease-metabolite association matrix Y DM The resulting Jaccard similarity matrix DJ (d i ,d u’ ) Is 0, then directly from the disease-metabolite correlation matrix Y DM The obtained adaptive spectral clustering similarity matrix DS (d i ,d i’ ) And (2) filling the value of (c) or else adding the two values to average the value to a new value.
Element DJS (d) i ,d i’ ) The formula is as follows:
wherein DJ (d) i ,d i’ ) For passing through disease-metabolite association matrix Y DM The resulting Jaccard similarity matrix, DS (d i ,d i’ ) For passing through disease-metabolite association matrix Y DM And (5) obtaining the self-adaptive spectrum clustering similarity matrix.
If it passes through the disease-metabolite association matrix Y DM The resulting Jaccard similarity matrix MJ (m j ,m j’ ) Is 0, then directly from the disease-metabolite correlation matrix Y DM The obtained adaptive spectral clustering similarity matrix MS (m j ,m j’ ) Filling the values of (2) or else, twoThe values are summed to average to a new value.
Element MJS (m) j ,m j’ ) The following formula:
wherein MJ (m) j ,m j’ ) For passing through disease-metabolite association matrix Y DM The resulting Jaccard similarity matrix, MS (m j ,m j’ ) For passing through disease-metabolite association matrix Y DM And (5) obtaining the self-adaptive spectrum clustering similarity matrix.
Specific embodiment six: the similarity acquisition module is used for acquiring a correlation adjacency matrix Y according to the correlation adjacency matrix Y DM The method comprises the steps of obtaining a Cosine similarity matrix between diseases and a Cosine similarity matrix between metabolites, wherein the Cosine similarity matrix comprises the following specific steps:
as shown in FIG. 4, the method of specifically calculating the similarity of Cosine is as follows, and two row vectors Y (d i ) And Y (d) i’ ). Then, the included angles are obtained, and cosine values corresponding to the included angles are obtained, and can be used for representing the similarity of the two vectors. The smaller the angle, the closer the cosine value is to 1 and the more identical their directions are, the more similar.
Disease-element DC in Cosine similarity matrix DC of disease (d i ,d i’ ) The following formula:
as shown in FIG. 5, the method of specifically calculating the similarity of Cosine is as follows, and two column vectors Y (m j’ ) And Y (m) j ). Then, the included angles are obtained, and cosine values corresponding to the included angles are obtained, and can be used for representing the similarity of the two vectors. The smaller the angle, the closer the cosine value is to 1, and their directions are more identicalThe more similar the combination.
Element MC (m) in the metabolite-metabolite Cosine similarity matrix MC i ,m j’ ) The method comprises the following steps:
wherein m is j’ And m j Respectively represent two different metabolites, Y (m j’ ) And Y (m) j ) Respectively represent and metabolite m j’ And metabolite m j Number of related disease sets.
Seventh embodiment: the second similarity fusion matrix acquisition module is configured to fuse a similarity matrix between diseases with the first disease similarity fusion matrix to obtain a second disease similarity fusion matrix, and fuse the similarity matrix between metabolites with the first metabolite similarity fusion matrix to obtain a second metabolite similarity fusion matrix, specifically:
if it passes through the disease-metabolite association matrix Y DM The obtained Cosine similarity matrix DC (d i ,d i’ ) Is 0, then directly from the previous fusion similarity matrix DJS (d i ,d i’ ) And (5) supplementing. Otherwise, from the disease-metabolite association matrix Y DM The obtained Cosine similarity matrix DC (d i ,d i’ ) And the previous fusion similarity matrix DJS (d i ,d i’ ) As a new similarity value.
Element DJSC (d) in second disease-like fusion matrix DJSC i ,d i’ ) The following formula:
if it passes through the disease-metabolite association matrix Y DM The obtained Cosine similarity matrix MC (m i ,m j’ ) Is 0, then directly from the previous fusion similarity matrix MJS (m j ,m j’ ) And (5) supplementing. Otherwise, from the disease-metabolite association matrix Y DM The obtained Cosine similarity matrix MC (m i ,m j’ ) And the previous fusion similarity matrix MJS (m j ,m j’ ) As a new similarity value.
Element MJSC (m in second metabolite-like fusion matrix MJSC j ,m j’ ) The following formula:
seventh embodiment: the disease-metabolite first network construction module is used for constructing a disease-metabolite first network by adopting a weighted k-nearest neighbor algorithm and utilizing an original disease-metabolite association bipartite network, a second disease similar fusion matrix and a second metabolite similar fusion matrix, and specifically comprises the following steps:
Y new =max(Y DM ,Y' new )
wherein,and->
Wherein, xi d And xi m :
Wherein, in the formula,is a matrix Y DM Line i->Representation matrix Y DM Column j, N (d) i ) Is disease d i N (m) j ) Is metabolite m j N 'neighbors, Y' new 、ξ d 、ξ m Is an intermediate variable,/->Is Y DM The row i of the column "is,is Y DM Column j ".
Eighth embodiment: the final prediction score matrix module is used for constructing a final prediction score matrix by using a disease-metabolite first network, a second disease similarity fusion matrix and a second metabolite similarity fusion matrix, and specifically comprises the following steps:
calculating a final predictive score matrix SNWKCP for both dqsc and MJSC, while SNWKCP (d i ,m j’ ) The value of (d) is in the range of 0 to 1, wherein SNWKCP (d) i ,m j’ ) The method comprises the following steps:
wherein DSNWKCP (d) i ,m j’ ) And MSNWKCP (d) i ,m j’ ) The method comprises the following steps:
wherein,is DJSC d i Go (go)/(go)>Is Y new Is the m < th > of j’ Column (S)/(S)>Is vector->Length of->Is MJSC mth j’ Column (S)/(S)>Is Y new D of (2) i Go (go)/(go)>Is vector->SNWKCP, SNWKCP is a scoring matrix calculated based on similarity between diseases, a scoring matrix calculated based on metabolite similarity.
The matrix SNWKCP is the final vector projection scoring matrix of the disease space and the metabolite space, with each value in the matrix representing the final score for each disease-metabolite data pair. The final score was used to predict disease-metabolite correlation. The higher its score, the higher the correlation.
Detailed description nine: a weight k-nearest neighbor based disease-metabolite associated prediction storage medium for storing at least one instruction for implementing a weight k-nearest neighbor based disease-metabolite associated prediction system.
Examples: in order to verify the beneficial effects of the invention, the following tests were performed:
by using a 5-fold CV algorithm for the prediction model evaluation to evaluate the performance of the present invention, an ROC image based on the 5-fold CV algorithm is shown in FIG. 6, and the ratio of the AUC of the 5-fold CV algorithm to other models is shown in Table 1.
Among the predictive results, the present invention validated that 3 disease states, obesity-rich, colorectal and lung cancer related metabolites of top 15, by predictive analysis of other known datasets, the validation results are shown in tables 2,3, 4.
Under the same dataset, the SNWKCP-DMA model and other models gave AUC values under 5-fold CV framework as shown in Table 1:
TABLE 1
Method | AUC |
MCF | 0.6156 |
WMAN | 0.6181 |
PROFANCY | 0.9027 |
MN-LMF | 0.9659 |
SNWKCP-DMA | 0.9819 |
Top 15 metabolites (Metabolite) associated with Obesity (Obesity), as shown in table 2:
TABLE 2
The top 15 Metabolite (metalite) associated with colorectal cancer (Colorectal cancer) is shown in table 3:
TABLE 3 Table 3
Top 15 Metabolite (metalite) associated with Lung Cancer (Lung Cancer), as shown in table 4:
TABLE 4 Table 4
The invention adopts the known disease-metabolite adjacent matrix to respectively carry out various similarity calculations on the disease and the metabolite, including Jaccard similarity calculation, self-adaptive spectral clustering similarity and Cosine similarity calculation, thereby obtaining Jaccard similarity matrix, self-adaptive spectral clustering similarity matrix and Cosine similarity matrix of the disease-disease and the metabolite-metabolite; integrating the similarity matrix to obtain a new disease-disease similarity matrix DJSC and a metabolite-metabolite similarity matrix MJSC; then calculating by using a weighted k-nearest neighbor algorithm to obtain a new disease-metabolite correlation network; then, calculating by using vector projection to obtain a final predictive score matrix SNWKCP; finally, the unknown association hidden under the data is revealed through the multi-aspect data relationship. Through the fusion of multiple similarities and a weighted k-nearest neighbor algorithm, the data dimension is more plump, and meanwhile, better results are obtained by combining two vector projections, and experiments show that the method has certain superiority compared with the traditional association relation constructing method, and the prediction results show that the association method has certain reliability.
Claims (2)
1. A disease-metabolite association prediction system based on weight k-nearest neighbor, characterized in that the system comprises: the system comprises a disease-metabolite correlation adjacency matrix acquisition module, a Jaccard similarity acquisition module, an adaptive spectral clustering similarity acquisition module, a first similarity fusion matrix acquisition module, a Cosine similarity acquisition module, a second similarity fusion matrix acquisition module, a disease-metabolite first network construction module, a final prediction score matrix construction module and a correlation acquisition module;
the disease-metabolite correlation adjacency matrix acquisition module: for constructing an original disease-metabolite association bipartite network according to known disease-metabolite association relationships, and establishing a correlation adjacency matrix Y using the original disease-metabolite association bipartite network DM ;
Establishing a correlation adjacency matrix Y by using an original disease-metabolite correlation bipartite network DM The following formula:
Y DM ={Y(i,j)} r*n
wherein r represents the number of kinds of diseases, and n represents the number of kinds of metabolites;
the primary disease-metabolite association bipartite network has the formula:
wherein Y (i, j) is the original disease-metabolite association bipartite network;
the Jaccard similarity acquisition module: for adjacency matrix Y according to correlation DM Acquiring a Jaccard similarity matrix DJ between diseases and a Jaccard similarity matrix MJ between metabolites;
element DJ (d) in the inter-disease Jaccard similarity matrix DJ i ,d i’ ) And element MJ (m) in Jaccard similarity matrix MJ between metabolites j ,m j’ ) Obtained by the following formula:
wherein d i And d i’ Representing two different diseases, m j’ And m i Is two different metabolites, Y (d i ) And Y (d) i’ ) Is Y DM Is defined as a row vector, Y (m j’ ) And Y (m) j ) Is Y DM Is a column vector of (1);
the adaptive spectral clustering similarity acquisition module is used for: for adjacency matrix Y according to correlation DM Acquiring an adaptive spectral clustering similarity matrix DS between diseases and an adaptive spectral clustering similarity matrix MS between metabolites;
element DS (d) in the inter-disease adaptive spectral cluster similarity matrix DS i ,d i’ ) Element MS (m) in adaptive spectral cluster similarity matrix MS between metabolites j ,m j’ ) Obtained by the following formula:
δ x =||Y(d x )-Y(d xK )||
δ x’ =||Y(m x’ )-Y(m x’K )||
wherein Y (d) xK ) Is Y (d) x ) Is the Kth neighbor point of (2), Y (m x’K ) Is Y (m) x’ ) X takes i or i ', x ' takes j or j ', delta x And delta x’ Is an intermediate variable, i and i' are Y DM Row vector labels of j and j' are Y DM K is a constant greater than 0;
the first similar fusion matrix acquisition module: the method comprises the steps of fusing a Jaccard similarity matrix DJ among diseases with an adaptive spectral clustering similarity matrix DS among diseases to obtain a first disease similarity fusion matrix DJS, and fusing a Jaccard similarity matrix MJ among metabolites with an adaptive spectral clustering similarity matrix MS among metabolites to obtain a first metabolite similarity fusion matrix MJS;
element DJS (d) i ,d i (ii) and the first metabolite are similar to the elements MJS (m) in the fusion matrix MJS j ,m j’ ) The following formula:
wherein d i And d i’ Representing two different diseases, m i’ And m j Is two different metabolites, DJ (d i ,d i’ ) Is an element in the Jaccard similarity matrix DJ between diseases, MJ (m j ,m j’ ) Is an element in the Jaccard similarity matrix MJ between metabolites, DS (d) i ,d i’ ) Is an element in an inter-disease adaptive spectral cluster similarity matrix DS, MS (m) j ,m j’ ) Is an element in an adaptive spectral clustering similarity matrix MS between metabolites;
the similarity acquisition module: for adjacency matrix Y according to correlation DM Acquiring a Cosine similarity matrix DC between diseases and a Cosine similarity matrix MC between metabolites;
element DC (d) in the inter-disease Cosine similarity matrix DC i ,d i’ ) Element MC (m) in the Cosine similarity matrix MC between metabolites j ,m j’ ) The following formula:
the second similarity fusion matrix acquisition module: the method comprises the steps of performing fusion on a Cosine similarity matrix DC among diseases and a first disease similarity fusion matrix DJS to obtain a second disease similarity fusion matrix DJSC, and performing fusion on a Cosine similarity matrix MC among metabolites and a first metabolite similarity fusion matrix MJS to obtain a second metabolite similarity fusion matrix MJSC;
element DJSC (d) in the second disease-like fusion matrix DJSC i ,d i’ ) Element MJSC (m j ,m j’ ) The following formula:
wherein DC (d) i ,d i’ ) Is an element in a Cosine similarity matrix DC between diseases, MC (m j ,m j’ ) Is an element in a Cosine similarity matrix MC between metabolites;
the disease-metabolite first network building block: for constructing a disease-metabolite first network Y using a weighted k-nearest neighbor algorithm using an original disease-metabolite association bipartite network, a second disease-similarity fusion matrix dqsc and a second metabolite-similarity fusion matrix MJSC new The following formula:
Y new =max(Y DM ,Y′ new )
in the method, in the process of the invention,is a matrix Y DM Line i->Representation matrix Y DM Column j, N (d) i ) Is disease d i N (m) j ) Is metabolite m j N 'neighbors, Y' new 、ξ d 、ξ m Is an intermediate variable,/->Is Y DM Line i @, @>Is Y DM Column j ";
the final prediction score matrix construction module: the method comprises the steps of constructing a final prediction score matrix SNWKCP by using a disease-metabolite first network, a second disease similarity fusion matrix DJSC and a second metabolite similarity fusion matrix MJSC;
the elements SNWKCP (d) in the final predictive score matrix SNWKCP i ,m j’ ) The following formula:
wherein,is DJSC d i Go (go)/(go)>Is Y new Is the m < th > of j’ Column (S)/(S)>Is vector->Is provided for the length of (a),is MJSC mth j’ Column (S)/(S)>Is Y new D of (2) i Go (go)/(go)>Is vector->DSNWKCP, MSNWKCP is an intermediate matrix;
the relevance acquisition module is used for: the method comprises the steps of searching a final prediction score matrix SNWKCP for a disease and metabolite prediction score of a relation to be predicted, wherein the higher the score is, the higher the disease and metabolite correlation is;
the predictive score is in the range of 0 to 1.
2. A disease-metabolite association prediction storage medium based on weight k-nearest neighbor, characterized by: the storage medium is for storing at least one instruction for implementing a weight k-nearest neighbor based disease-metabolite association prediction system of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310059889.XA CN116052873B (en) | 2023-01-18 | 2023-01-18 | Disease-metabolite association prediction system based on weight k-nearest neighbor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310059889.XA CN116052873B (en) | 2023-01-18 | 2023-01-18 | Disease-metabolite association prediction system based on weight k-nearest neighbor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116052873A CN116052873A (en) | 2023-05-02 |
CN116052873B true CN116052873B (en) | 2024-01-26 |
Family
ID=86132918
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310059889.XA Active CN116052873B (en) | 2023-01-18 | 2023-01-18 | Disease-metabolite association prediction system based on weight k-nearest neighbor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116052873B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106203471A (en) * | 2016-06-22 | 2016-12-07 | 南京航空航天大学 | A kind of based on the Spectral Clustering merging Kendall Tau distance metric |
CN107887023A (en) * | 2017-12-08 | 2018-04-06 | 中南大学 | A kind of microbial diseases Relationship Prediction method based on similitude and double random walks |
KR20190000166A (en) * | 2017-06-22 | 2019-01-02 | 한국과학기술원 | Method and system for predicting drug repositioning candidate based on similarity between drug and metabolite |
CN109935332A (en) * | 2019-03-01 | 2019-06-25 | 桂林电子科技大学 | A kind of miRNA- disease association prediction technique based on double random walk models |
CN110610763A (en) * | 2019-09-10 | 2019-12-24 | 陕西师范大学 | KaTZ model-based metabolite and disease association relation prediction method |
CN112289373A (en) * | 2020-10-27 | 2021-01-29 | 齐齐哈尔大学 | lncRNA-miRNA-disease association method fusing similarity |
CN115602243A (en) * | 2022-11-02 | 2023-01-13 | 曲阜师范大学(Cn) | Disease associated information prediction method based on multi-similarity fusion |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11037684B2 (en) * | 2014-11-14 | 2021-06-15 | International Business Machines Corporation | Generating drug repositioning hypotheses based on integrating multiple aspects of drug similarity and disease similarity |
-
2023
- 2023-01-18 CN CN202310059889.XA patent/CN116052873B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106203471A (en) * | 2016-06-22 | 2016-12-07 | 南京航空航天大学 | A kind of based on the Spectral Clustering merging Kendall Tau distance metric |
KR20190000166A (en) * | 2017-06-22 | 2019-01-02 | 한국과학기술원 | Method and system for predicting drug repositioning candidate based on similarity between drug and metabolite |
CN107887023A (en) * | 2017-12-08 | 2018-04-06 | 中南大学 | A kind of microbial diseases Relationship Prediction method based on similitude and double random walks |
CN109935332A (en) * | 2019-03-01 | 2019-06-25 | 桂林电子科技大学 | A kind of miRNA- disease association prediction technique based on double random walk models |
CN110610763A (en) * | 2019-09-10 | 2019-12-24 | 陕西师范大学 | KaTZ model-based metabolite and disease association relation prediction method |
CN112289373A (en) * | 2020-10-27 | 2021-01-29 | 齐齐哈尔大学 | lncRNA-miRNA-disease association method fusing similarity |
CN115602243A (en) * | 2022-11-02 | 2023-01-13 | 曲阜师范大学(Cn) | Disease associated information prediction method based on multi-similarity fusion |
Also Published As
Publication number | Publication date |
---|---|
CN116052873A (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Spann et al. | Applying machine learning in liver disease and transplantation: a comprehensive review | |
Choi | Deep learning in nuclear medicine and molecular imaging: current perspectives and future directions | |
Casiraghi et al. | Explainable machine learning for early assessment of COVID-19 risk prediction in emergency departments | |
Owais et al. | Artificial intelligence-based classification of multiple gastrointestinal diseases using endoscopy videos for clinical diagnosis | |
Li et al. | A generalized framework of feature learning enhanced convolutional neural network for pathology-image-oriented cancer diagnosis | |
Thirunavukarasu et al. | Towards computational solutions for precision medicine based big data healthcare system using deep learning models: A review | |
CN115049603B (en) | Intestinal polyp segmentation method and system based on small sample learning | |
Saeed et al. | TMSS: an end-to-end transformer-based multimodal network for segmentation and survival prediction | |
Bhardwaj et al. | Computational biology in the lens of CNN | |
Jin et al. | Deep learning based classification of multi-label chest X-ray images via dual-weighted metric loss | |
Sattar et al. | Lung cancer prediction using multi-gene genetic programming by selecting automatic features from amino acid sequences | |
Meng et al. | Radiomics-enhanced deep multi-task learning for outcome prediction in head and neck cancer | |
Kovalev et al. | Biomedical image recognition in pulmonology and oncology with the use of deep learning | |
Chen et al. | Identifying cardiomegaly in chest x-rays using dual attention network | |
Daza et al. | Cerberus: A multi-headed network for brain tumor segmentation | |
Tenali et al. | Oral Cancer Detection using Deep Learning Techniques | |
CN116052873B (en) | Disease-metabolite association prediction system based on weight k-nearest neighbor | |
Mukherji et al. | Recent landscape of deep learning intervention and consecutive clustering on biomedical diagnosis | |
Gholami et al. | Proposing method to Increase the detection accuracy of stomach cancer based on colour and lint features of tongue using CNN and SVM | |
Liu et al. | Combining self-training and hybrid architecture for semi-supervised abdominal organ segmentation | |
CN111582330A (en) | Integrated ResNet-NRC method for dividing sample space based on lung tumor image | |
Fiaidhi et al. | Thick data analytics for rating ulcerative colitis severity using small endoscopy image sample | |
CN114999566B (en) | Drug repositioning method and system based on word vector characterization and attention mechanism | |
Wang et al. | Gene selection for cancer detection using graph signal processing | |
CN114708347A (en) | Lung nodule CT image classification method based on self-adaptive selection dual-source-domain heterogeneous migration learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |