WO2020056389A1 - Multimodal signatures and use thereof in the diagnosis and prognosis of diseases - Google Patents

Multimodal signatures and use thereof in the diagnosis and prognosis of diseases Download PDF

Info

Publication number
WO2020056389A1
WO2020056389A1 PCT/US2019/051193 US2019051193W WO2020056389A1 WO 2020056389 A1 WO2020056389 A1 WO 2020056389A1 US 2019051193 W US2019051193 W US 2019051193W WO 2020056389 A1 WO2020056389 A1 WO 2020056389A1
Authority
WO
WIPO (PCT)
Prior art keywords
subject
modalities
features
modality
network
Prior art date
Application number
PCT/US2019/051193
Other languages
French (fr)
Inventor
Naisha SHAH
Ilan SHOMORONY
Elizabeth Cirulli Rogers
Ewen Frisken KIRKNESS
Original Assignee
Human Longevity, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human Longevity, Inc. filed Critical Human Longevity, Inc.
Publication of WO2020056389A1 publication Critical patent/WO2020056389A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/20ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS

Definitions

  • the embodiments disclosed herein are generally directed towards systems and methods for performing multi-modal assessments of disease risk in patients. More specifically, there is a need for systems and methods that can use multiple modalities of data (e.g., whole genome sequencing, advanced imaging, metagenomic sequencing, metabolome, clinical labs, etc.) to make predictions about an individual’s health status.
  • multiple modalities of data e.g., whole genome sequencing, advanced imaging, metagenomic sequencing, metabolome, clinical labs, etc.
  • Embodiments of the disclosure relate to multimodal assessment of metabolic diseases such as diabetes, hypertension and obesity, using a wide variety of genomic, imaging, metabolomics, and laboratory data.
  • the methods of the disclosure include whole genome sequencing, advanced imaging, metagenomic sequencing, metabolome, and clinical labs.
  • the multimodal platform described herein not only allows identification of previously undiagnosed disease states but also to identify early disease biomarkers.
  • the systems and methods of the disclosure are built from a large cohort of with a wide range of data modalities, which allows for robust testing and/or validation of the associations between disease markers and the metabolic diseases.
  • the multimodal datasets comprise data from 1,253 self-assessed healthy adults (median age 53; 63% male).
  • an independent female-only validation dataset consisting of 1,083 adults with longitudinal data was also included in the cohort.
  • a comprehensive analysis was conducted, enabling identification of novel signatures and/or patterns associated with disease risk. Based on these signatures, patients could be stratified and their current disease states and/or disease transition states be identified reliably and accurately.
  • the systems and methods of the disclosure include an amalgamation of machine learning analyses including cross-modality associations, formation of modules of densely connected features, which permitted identification of key biomarkers, clustering individuals into distinct health risk groups with corresponding biomarker signatures, and enrichment of longitudinal outcomes of individuals within each risk group.
  • the systems and methods of the disclosure permit assessment of health status of subjects who are identified to be at risk and also of diseased subjects who are undergoing various types of lifestyle, dietary and/or therapeutic interventions.
  • the disclosure relates to a method for diagnosing a metabolic syndrome in a subject, comprising, a) normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features; b) identifying statistically significant associations across the data features across each modality to identify correlations between the modalities and form a correlation network; c) analyzing structures of the correlations network by forming modules; d) performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module; e) partitioning a cohort of subjects into distinct health profiles with corresponding biomarker signatures; f) optionally integrating personal history and, further optionally integrating the longitudinal disease diagnosis data for each subject to strengthen the health profile of each subject; and g) determining a metabolic syndrome risk for each subject based on the health profile of each subject.
  • the disclosure relates to a method for diagnosing a metabolic syndrome according to the foregoing or the following, wherein the modalities include whole- genome sequencing (WGS), microbiome, global metabolome, laboratory analysis, magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history.
  • WGS whole- genome sequencing
  • MRI magnetic resonance imaging
  • CT computed tomography
  • the disclosure relates to a method for diagnosing a metabolic syndrome according to the foregoing or the following, wherein the modality includes laboratory analysis comprising lab-developed tests for insulin resistance and prediabetes.
  • the disclosure relates to a method for diagnosing a metabolic syndrome according to the foregoing or the following, comprising amalgamation of machine learning analyses including identification of significant cross-modality associations; formation of modules of densely connected features to identify key biomarkers; clustering individuals into distinct health risk groups with corresponding biomarker signatures; and enrichment of longitudinal outcomes of individuals within each risk group.
  • the disclosure relates to a computer readable medium comprising computer-executable instructions, which, when executed by a processor, cause the processor to carry out a method or a set of steps for diagnosing a metabolic syndrome in a subject, the method or the set of steps comprising, a) normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features; b) identifying statistically significant associations across the data features across each modality to identify correlations between the modalities and form a correlations network; c) analyzing structures of the correlations network by forming modules; d) performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module; e) partitioning a cohort of subjects into distinct health profiles with corresponding biomarker signatures; f) optionally integrating personal history and, further optionally integrating the longitudinal disease diagnosis data for each subject to strengthen the health profile of each subject; and g)
  • the disclosure relates to a system for diagnosis of a metabolic syndrome , comprising: a) a normalizer for normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features; b) a concatenate engine for identifying statistically significant associations across the data features across each modality to identify correlations between the modalities to form a correlations network; c) a structure analyzer for analyzing structures of the correlations network by forming modules; d) a graphical analyzer for performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module; e) a clustering module for partitioning a cohort of subjects into clusters of subjects with distinct health profiles with corresponding biomarker signatures; f) an integrator for optionally integrating each subject’s personal history and, further optionally integrating each subject’s longitudinal disease diagnosis data to strengthen the health profile of each subject; and g) a
  • FIGS. 1A-1E depicts an outline of how a multimodal health assessment can be performed, in accordance with various embodiments.
  • FIG. 1A shows various modalities of features that can be collected from individuals.
  • FIG. IB shows that data can be analyzed by performing cross-modality associations on Gaussian transformed features after correcting for age, sex and ancestry.
  • FIG. 1C shows that using the associations, community detection analysis can be performed and modules of densely connected features can be identified.
  • FIG. ID shows that conditional independence network analysis (also referred to as a Markov Network) could be performed to reduce the correlatedness of features and identify key biomarker features.
  • FIG. IE shows that individuals could be clustered into distinct groups of health profiles using the identified biomarkers of the disclosure, which clusters were then used to perform disease risk enrichment analysis.
  • FIGS. 2A-2B depict the results of cross-modality correlations for various pairings of modalities, in accordance with various embodiments.
  • FIG. 2A shows the number of significant cross-modality correlations for each pair of modalities. The percentages shown are the proportion of correlations that were significant out of all possible pairwise associations between the modality- pair.
  • FIG. 2B shows associations between p-cresol sulfate metabolite and (top) abundance of Intestinimonas genus, and (bottom) an abundance of unclassified genus in Erysipelotrichaceae family.
  • FIGS. 3A-3B depict a multi-modal cardiometabolic module used in an analysis cardiovascular and metabolic disease risk, in accordance with various embodiments.
  • FIG. 3A shows identification of key biomarker features that represent the cardiometabolic module using Markov network analysis. These reduced interactions highlight the most important associations after removing edges corresponding to indirect associations. It was observed that the microbiome genera Butyrivibrio and Pseudoflavonifractor are the most relevant microbiome genera in the context of this module that interfaces with features from other modalities.
  • FIG. 3B shows clustering of individuals using the key biomarkers. The heatmap shows z-statistics from logistic regression for an association between each cluster and each feature. The plot on the left shows the 22 key cardiometabolic biomarkers. The plots on the right show significant associations that emerged from an analysis against the full set of 1,385 features.
  • FIGS. 4A-4B depicts disease enrichment and longitudinal outcomes of cardiometabolic clusters, in accordance with various embodiments.
  • FIG. 4A shows bar plots showing the prevalence of disease at baseline (combined Discovery and TwinsUK baseline cohorts; FIG. 7A and FIG. 7B show them individually) and the incidence of disease (i.e., only the new cases of disease) after a median of 5.6 years of follow-up (TwinsUK cohort). *p ⁇ 0.05, **p ⁇ 0.005.
  • FIG. 4B shows the rates at which individuals from each cluster transition into other clusters after a median of 5.6 years of follow-up. The plot shows individuals per cluster (1 to 7) at baseline visit that transition to other clusters during the follow-up.
  • FIGS. 5A-5B depicts a multi-modal microbiome richness module used in analysis of diversity in an individual’s gut microbiome, in accordance with various embodiments.
  • FIG. 5A shows the identification of key biomarker features that represent the microbiome richness module using Markov network analysis.
  • FIG. 5B shows the clustering of individuals using the key biomarkers.
  • the heatmap shows z-statistics from logistic regression for an association between each cluster and each feature.
  • FIGS. 6A-6B show further details of the clustering of individuals in the multi-modal cardiometabolic module analysis, in accordance with various embodiments.
  • the heatmap shows the Z-statistics from a logistic regression for an association between each cluster and each feature.
  • the plot on the left shows the 22 key cardiometabolic features.
  • the plots on the right show significant associations that emerged from an analysis against the full set of 1,385 features.
  • the first plot begins with the features that had significant associations with multiple clusters, and the remaining plots show features that were significantly associated with only one feature.
  • the highlighted groups e.g., Lipid Group 1, Lipid Group 2, etc.
  • FIGS. 7A-7B depict bar plots showing prevalence of disease diagnoses, in accordance with various embodiments.
  • FIG. 7A and FIG. 7B show the Discovery and TwinsUK cohorts, respectively, at baseline.
  • the combined cohort (*p ⁇ 0.05, **p ⁇ 0.005) is shown in FIG.
  • FIGS. 8A-8B show further details of the clustering of individuals in the multi-modal microbiome richness module analysis, in accordance with various embodiments.
  • the heatmap shows the Z-statistics from a logistic regression for an association between each cluster and each feature.
  • the plot on the left shows the 24 key biomarkers.
  • the plots on the right show significant associations that emerged from an analysis against the full set of 1,385 features.
  • the first plot begins with the features that had significant associations with multiple clusters, and the remaining plots show features that were significantly associated with only one feature. All features replicated in the TwinsUK validation cohort (with some exceptions).
  • FIG. 9 is an exemplary flowchart showing a method for diagnosing metabolic syndrome in a subject, in accordance with various embodiments.
  • FIG. 10 is an illustration of a system for diagnosing metabolic syndrome in subjects, in accordance with various embodiments.
  • FIG. 11 is a block diagram that illustrates a computer system, in accordance with various embodiments.
  • the disclosure relates to various exemplary embodiments of systems and methods for performing multi-modal assessments of disease risk in patients.
  • the disclosure is not limited to these exemplary embodiments and applications or to the manner in which the exemplary embodiments and applications operate or are described herein.
  • the figures may show simplified or partial views, and the dimensions of elements in the figures may be exaggerated or otherwise not in proportion.
  • one element e.g., a material, a layer, a substrate, etc.
  • one element can be“on,”“attached to,”“connected to,” or“coupled to” another element regardless of whether the one element is directly on, attached to, connected to, or coupled to the other element or there are one or more intervening elements between the one element and the other element.
  • elements e.g., elements a, b, c
  • such reference is intended to include any one of the listed elements by itself, any combination of less than all of the listed elements, and/or a combination of all of the listed elements. Section divisions in the specification are for ease of review only and do not limit any combination of elements discussed.
  • Enzymatic reactions and purification techniques are performed according to manufacturer’s specifications or as commonly accomplished in the art or as described herein.
  • the techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification. See, e.g. , Sambrook et al. , Molecular Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000).
  • the nomenclatures utilized in connection with, and the laboratory procedures and techniques described herein are those well-known and commonly used in the art.
  • next generation sequencing refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches, for example with the ability to generate hundreds of thousands of relatively small sequence reads at a time.
  • next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization. More specifically, the MISEQ, HISEQ and NEXTSEQ Systems of Illumina and the Personal Genome Machine (PGM) and SOLiD Sequencing System of Life Technologies Corp, provide massively parallel sequencing of whole or targeted genomes.
  • PGM Personal Genome Machine
  • SOLiD Sequencing System of Life Technologies Corp
  • the phrase“genomic features” can refer to a genome region with some annotated function (e.g., a gene, protein coding sequence, mRNA, tRNA, rRNA, repeat sequence, inverted repeat, miRNA, siRNA, etc.) or a genetic/genomic variant (e.g., single nucleotide polymorphism/variant, insertion/deletion sequence, copy number variation, inversion, etc.) which denotes a single or a grouping of genes (in DNA or RNA) that have undergone changes as referenced against a particular species or sub-populations within a particular species due to mutations, recombination/crossover or genetic drift.
  • some annotated function e.g., a gene, protein coding sequence, mRNA, tRNA, rRNA, repeat sequence, inverted repeat, miRNA, siRNA, etc.
  • a genetic/genomic variant e.g., single nucleotide polymorphism/variant, insertion
  • Genomic variants can be identified using a variety of techniques, including, but not limited to: array-based methods (e.g., DNA microarrays, etc.), real-time/digital/quantitative PCR instrument methods and whole or targeted nucleic acid sequencing systems (e.g., NGS systems, Capillary Electrophoresis systems, etc.). With nucleic acid sequencing, coverage data can be available at single base resolution.
  • array-based methods e.g., DNA microarrays, etc.
  • real-time/digital/quantitative PCR instrument methods e.g., whole or targeted nucleic acid sequencing systems
  • whole or targeted nucleic acid sequencing systems e.g., NGS systems, Capillary Electrophoresis systems, etc.
  • coverage data can be available at single base resolution.
  • substantially means sufficient to work for the intended purpose.
  • the term “substantially” thus allows for minor, insignificant variations from an absolute or perfect state, dimension, measurement, result, or the like such as would be expected by a person of ordinary skill in the field but that do not appreciably affect overall performance.
  • substantially means within ten percent.
  • the term“plurality” can be 2, 3, 4, 5, 6, 7, 8, 9, 10, or more.
  • modalities were used to collect the data.
  • the modalities included whole genome sequencing (WGS), microbiome sequencing, global metabolome, insulin resistance (IR) and glucose intolerance (IGT) laboratory developed tests (QuantoseTM), whole body and brain magnetic resonant imagining (MRI), dual-energy x-ray absorptiometry (DEXA), computed tomography (CT) scan, routine clinical labs, personal/family history of disease and medication, and vitals/anthropometric measurements.
  • WGS whole genome sequencing
  • IR insulin resistance
  • ITT glucose intolerance
  • QuantoseTM whole body and brain magnetic resonant imagining
  • MRI dual-energy x-ray absorptiometry
  • CT computed tomography
  • Table 1 Table shows number of individuals and number of features measured per modality.
  • the body composition features from DEXA and MRI were combined and treated as a separate modality (“Body composition”).
  • MRI magnetic resonance imaging
  • DEXA dual energy X-ray absorptiometry
  • CT Computed tomography.
  • CT scans were performed on individuals over the age of 35 years. Patients were scanned during a single breath-hold using a 64-slice GE Healthcare EVO Revolution scanner (GE Healthcare, Milwaukee, Wisconsin). Gated axial scans with 2.5 mm slice thickness were performed using a tube energy of 120 kVp and the tube current adjusted for individuals' body mass index. Images were subsequently analyzed using an AW VolumeShare 7 workstation (GE Healthcare, Milwaukee, Wisconsin) and regions of coronary calcification were manually identified in order to compute Coronary Artery Calcium (CAC) Agatstan scores. Multi-Ethnic Study of Atherosclerosis (MESA) reference CAC values were used to calculated the percentile of calcification for each individual matched for age, sex and ethnicity.
  • ESA Multi-Ethnic Study of Atherosclerosis
  • microbiome sequencing whole genome sequencing was performed on stool samples to analyze the microbial communities.
  • the features included species richness, species diversity, the fraction of human DNA, Proteobacteria, and the abundance of 72 genera.
  • Microbiome species richness is defined as the number of species present at a relative abundance greater than 10 4 .
  • Microbiome species diversity is defined as the Shannon entropy of the taxon abundance vector.
  • Nodes were allowed to belong to multiple modules. This was allowed only when a node was assigned to a module by the community detection algorithm but had more than 20 significant associations in another module (or more associations with another module than it had with its assigned module).
  • a list of candidate biomarkers were initially selected using eigenvector centrality. More precisely, for the subnetwork corresponding to each of these two modules, all the nodes were ranked according to their eigenvector centrality score. For the cardiometabolic module, the 50 most central features were selected and for the microbiome richness module, the 40 most central features were selected.
  • the Markov network allows the identification of features that were only associated with the rest of the network through other features in its own modalities. By dropping such features, a final set of biomarkers was obtained, both for the cardiometabolic module and for the microbiome richness module. [0049] These key biomarkers were then utilized to cluster the individuals in the cohort. Individuals were selected based on whether they had key biomarkers (for a total of 668 individuals for the cardiometabolic module, and 640 individuals in the microbiome diversity module). The resulting data matrix had each feature scaled to have zero mean and unit variance. The missing values were imputed using softlmpute.
  • Hierarchical clustering was performed on the set of individuals based on complete linkage and a correlation distance metric, and extracted clusters from the dendrogram.
  • TwinsUK 1,083 individuals from a study cohort (referred to here as“TwinsUK”) of largely European ancestry female twins enrolled in the TwinsUK registry, a British national register of adult twins.
  • the cohort included data from WGS, metabolome, microbiome, DEXA, clinical blood labs, and personal history of disease and medication.
  • the data from the modalities was collected from three longitudinal visits over the course of a median of 13 years. To capture a population with adequate sample sizes for the overlapping modalities used in the present study, the analysis was restricted to data from visit 2 (referred here as“baseline”) and visit 3 (referred here as“follow-up”). Microbiome samples were only collected at visit 3.
  • phenotyping measurements were required to be collected within 90 days of the metabolome draw for each visit, or within 6 months for microbiome.
  • metabolome and microbiome correlations only one of the twins was used to avoid bias from relatedness, totaling 538 individuals.
  • liver fat, Gamma-Glutamyl Transferase (GGT), IGT, IR and glucose were imputed using regularized linear regression with Ll penalty (R glmnet package).
  • the third largest number of significant associations was between metabolome and body composition.
  • BMI body mass index
  • VAT visceral adipose tissue
  • IR insulin resistance
  • body composition features e.g. BMI, VAT, android/gynoid ratio, fat mass and lean mass
  • pCS metabolite -cresol sulfate
  • pCS is a microbial metabolite associated with accelerated cardiovascular disease and renal disease progression, a potential uremic toxin. It is a sulfated phenolic compound generated in the colon by bacterial fermentation of tyrosine.
  • the associations of pCS with species diversity and Ruminococcaceae family has previously been observed but not with Intensitnimonas and a genus in Erysipelotrichaceae family. The associations were validated in an independent TwinsUK cohort (see Methods; Table 2).
  • Table 2 The table shows microbiome genera that are associated with a metabolite -cresol sulfate in both the discovery cohort and the replication cohort.
  • the cardiometabolic module in the association network contained 355 nodes from clinical labs, metabolome, quantose, CT, microbiome, vitals, genetics, MRI-body and body composition data modalities.
  • the features in this module were ranked by their relative centrality in the module using eigenvector centrality score (see Methods), and several markers associated with obesity, heart disease, and metabolic syndrome were verified. Thus, the module was assigned its name - cardiometabolic module.
  • the most central features for the module were VAT, BMI, liver fat percentage, lean mass percentile, glucose levels, blood pressure, triglycerides levels, IR score, several lipid metabolites, and several microbiome genera, including butyrate -producing bacterium genera such as Pseudoflavonifractor, Butyrivibrio , Intestinimonas, and Faecalibacterium.
  • the module provides a general overview of how these features are interconnected, its construction is based only on pairwise associations. As such, it contains a significant amount of redundancy (e.g., two metabolites from the same pathway are likely to be connected to the same features from other modalities) and transitive edges (i.e., if and A and B are associated, and B and C are associated, an association between A and C is likely to be observed).
  • redundancy e.g., two metabolites from the same pathway are likely to be connected to the same features from other modalities
  • transitive edges i.e., if and A and B are associated, and B and C are associated, an association between A and C is likely to be observed.
  • the 50 most central features were picked and the inverse covariance matrix was computed. This matrix defines a new network (called the Markov network ) on these 50 features with the property that features A and B are only connected if they are correlated conditioned on all other features.
  • the resulting network is shown in FIG.
  • the Markov network emphasizes the most direct connections in the module. It suggests that (a) microbiome genera Butyrivibrio and Pseudoflavonifractor are“closest” to the remainder of the cardiometabolic module via a lipid metabolite l-(l-enyl-palmitoyl)-2-oleoyl-GPC (P- 16:0/18:1) and serum triglyceride, (b) systolic and diastolic blood pressure are mostly redundant from the point of the central variables in the module, demonstrated by the thickness of the edges, and (c) liver iron and gamma-tocopherol/beta-tocopherol are only associated to the rest of the module through other variables in their respective modalities. These observations allows a determination of a pruned set of 22 key cardiometabolic features (referred to as key biomarkers).
  • the key biomarkers included known and expected features for cardiac and metabolic conditions (such as BMI, blood pressure, glucose levels and HDL) but also novel biomarkers (such as several metabolites and microbiome genera) that distinguishes susceptibility to disease morbidity (FIG. 3A). High abundance of the microbiome genera Butyrivibrio and Pseudoflavonifractor were well correlated with good cardiometabolic health.
  • the individuals in cluster 1 can be characterized as containing perceived healthiest individuals, with a markedly higher lean mass percentile and low IR score.
  • This cluster is notable for its lower blood pressure, lower butyrylcamitine levels, and higher HDL.
  • the IR score and lean mass percentile for cluster 2 and 3 were not as healthy as those of cluster 1.
  • cluster 2 displays the lowest glutamate values
  • cluster 3 is characterized by the lowest blood pressure and the highest levels of 3-hydroxybutyrate.
  • Cluster 4 is distinguished by an Impaired Glucose Tolerance (IGT) score that is higher than in the other clusters with healthy individuals and high levels of Apolipoprotein-A (Apo-A) and HDL cholesterol.
  • ITT Impaired Glucose Tolerance
  • Cluster 5 contains largely overweight individuals who nonetheless have low IR scores and low IGT.
  • Cluster 6 contains mostly overweight and obese individuals with high android/gynoid ratios and IR scores who were specifically characterized by high the highest Apo-B, cholesterol in very low-density lipoprotein and triglycerides of any cluster.
  • Cluster 7 contains the least healthy individuals with respect to the markers in consideration, with a high prevalence of obesity, body fat and insulin resistance.
  • the cardiometabolic key biomarkers that were the largest drivers of this association between diabetes and cluster 7 were the IR score, percent lean body mass, and the metabolites 1- stearoyl-2-dihomo-linolenoyl-GPC (18:0/20:3h3 or 6) and l-(l-enyl-palmitoyl)-2-oleoyl-GPC (P- 16:0/18:1).
  • the above mentioned were the four features that were significantly associated with diabetes status. They were also significant predictors of cluster 7, in addition to liver fat, HDL cholesterol, Pseudoflavonifractor, and the metabolites lactate and l-eicosenoyl-GPC (20: 1)).
  • pCS -cresol sulfate
  • Intestinimonas is a microbial metabolite and is often considered to be a uremic toxin. It is produced by bacteria fermenting undigested dietary proteins that escape absorption in the small bowel. It appears to be elevated in the sera of chronic kidney disease (CKD) patients, and it is associated with increased mortality in patients with CKD and an increased risk of cardiovascular events.
  • Intestinimonas is known for its butyrate producing species by digesting lysine and fructoselysine in the human gut, but is otherwise not well described.
  • Erysipelotrichaceae family might be immunogenic and can potentially flourish post-treatment with broad spectrum antibiotics. An increased abundance of Erysipelotrichaceae has been observed in obese individuals, and several other evidences suggests its role in lipid metabolism. These novel associations were validated in TwinsUK and could further be analyzed for therapeutic targets to decrease pCS levels and its toxicity.
  • a cardiometabolic module with key biomarkers consisting of novel features in addition to the traditional clinical features from several modalities was identified.
  • the potentially novel biomarkers included abundance of the microbiome genera Butyrivibrio and Pseudoflavonifr actor and several metabolites, such as l-(l-enyl-palmitoyl)-2-oleoyl-GPC, l-eicosenoyl-GPC, glutamate, and l-stearoyl-2-dihomo-linolenoyl-GPC. Clustering of individuals using the key biomarker revealed signatures of disease states.
  • profiles for healthy individuals were identified, which are consistent with very low prevalence of diabetes, hypertension, and obesity (Cluster 1) and a profile for individuals displaying comorbidity for diabetes (Cluster 7).
  • the cluster membership for individuals was a better predictor of diabetes than the traditional clinical biomarkers such as glucose, BMI and insulin resistance.
  • the novel biomarkers in the diabetes signature included l-stearoyl-2-dihomo-linolenoyl-GPC and l-(l-enyl-palmitoyl)-2-oleoyl-GPC. Longitudinal disease outcome analysis using follow-up TwinsUK data found early disease signature for hypertension (Cluster 6).
  • cluster 7 the unhealthiest cluster
  • These signature can be used to prioritize individuals for intervention.
  • Analysis of the microbiome richness module revealed a xenobiotics metabolite cinnamoylglycine as a potential biomarker for health associated with microbiome species richness and lean mass percentage. Cinnamoylglycine is observed to be abundant in individuals in cluster 1 representing healthy individuals.
  • the novel biomarkers in the diabetes signature included l-stearoyl-2-dihomo-linolenoyl-GPC and l-(l-enyl-palmitoyl)-2-oleoyl-GPC.
  • Early disease signature for hypertension was identified, and individuals at-risk for a poor health outcome.
  • a xenobiotics metabolite cinnamoylglycine was found as a potential biomarker for health associated with microbiome species richness and lean mass percentage.
  • novel associations were identified and biomarker signatures that stratify individuals into distinct disease subtypes, including early disease states; an essential step towards personalized, preventative health risk assessment.
  • FIG. 9 is an exemplary flowchart showing a method for diagnosing metabolic syndrome in a subject, in accordance with various embodiments.
  • method 900 details an exemplary method for determining metabolic syndrome risk for an individual, in accordance with various embodiments.
  • heterogenous data features are derived from a plurality of modalities, wherein each modality comprises a plurality of data features.
  • the modalities include, but are not limited to: whole-genome sequencing (WGS), microbiome, global metabolome, laboratory analysis, magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history.
  • step 904 statistically significant associations are identified across the data features across each modality to identify correlations between the modalities and form a correlations network.
  • step 906 structures of the correlations network are analyzed by forming modules.
  • step 908 an in-depth analysis of selected modules using probabilistic graphical models is performed to identify a network of key biomarkers that represent the module.
  • a cohort of subjects is partitioned into distinct health profiles with corresponding biomarker signatures.
  • the cohorts are clustered in order to partition them into distinct health profiles.
  • step 912 personal history and longitudinal disease diagnosis data is optionally integrated for each subject in the cohort of subjects to strengthen the health profile of each subject.
  • step 914 a metabolic syndrome risk is determined for each subject based on the health profile associated with each subject.
  • FIG. 10 is an illustration of a system for diagnosing metabolic syndrome in subjects, in accordance with various embodiments.
  • system 1000 is comprised of a computing device/server 1004 that is in communications with a plurality of different modalities of data sources 1002.
  • the computing device/server 1004 can be configured to host a normalizer 1006, a concatenate engine 1008, a structure analyzer 1010, a graphical analyzer 1012, a clustering module 1014, an integrator 1016 and a risk assessor 1018.
  • the normalizer 1006 can be configured to normalize heterogeneous data features derived from the plurality of different modalities.
  • each modality comprises a plurality of data features.
  • the modalities include, but are not limited to: whole-genome sequencing (WGS), microbiome, global metabolome, laboratory analysis, magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history.
  • the concatenate engine 1008 can be configured to identify statistically significant associations across the data features across each modality to identify correlations between the modalities to form a correlations network.
  • the structure analyzer 1010 can be configured to analyze structures of the correlations network by forming modules.
  • the graphical analyzer 1002 can be configured to perform in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module.
  • the clustering module 1014 can be configured to partition a cohort of subjects into clusters of subjects with distinct health profiles with corresponding biomarker signatures. In various embodiments, the cohorts are clustered in order to partition them into distinct health profiles.
  • the integrator 1016 can be configured to optionally integrate each subject’s personal history and, further optionally integrate each subject’s longitudinal disease diagnosis data to strengthen the health profile of each subject.
  • the risk assessor 1018 can be configured to assess a metabolic syndrome risk for each subject based on the health profile of each subject and send that to a display 1020 that this communicatively connected with the computing device/server 1004.
  • FIG. 11 is a block diagram that illustrates a computer system 1100, upon which embodiments of the present teachings may be implemented.
  • computer system 1100 can include a bus 1102 or other communication mechanism for communicating information, and a processor 1104 coupled with bus 1102 for processing information.
  • computer system 1100 can also include a memory, which can be a random access memory (RAM) 1106 or other dynamic storage device, coupled to bus 1102 for determining instructions to be executed by processor 1104. Memory also can be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1104.
  • RAM random access memory
  • computer system 1100 can further include a read-only memory (ROM) 1108 or other static storage device coupled to bus 1102 for storing static information and instructions for processor 1104.
  • ROM read-only memory
  • a storage device 1110 such as a magnetic disk or optical disk, can be provided and coupled to bus 1102 for storing information and instructions.
  • computer system 1100 can be coupled via bus 1102 to a display 1112, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user.
  • a display 1112 such as a cathode ray tube (CRT) or liquid crystal display (LCD)
  • An input device 1114 can be coupled to bus 1102 for communicating information and command selections to processor 1104.
  • a cursor control 1116 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1104 and for controlling cursor movement on display 1112.
  • This input device 1114 typically has two degrees of freedom in two axes, a first axis (i.e., x) and a second axis (i.e., y), that allows the device to specify positions in a plane.
  • a first axis i.e., x
  • a second axis i.e., y
  • input devices 1114 allowing for three-dimensional (x, y, and z) cursor movement are also contemplated herein.
  • results can be provided by computer system 1100 in response to processor 1104 executing one or more sequences of one or more instructions contained in memory 1106.
  • Such instructions can be read into memory 1106 from another computer-readable medium or computer-readable storage medium, such as storage device 1110.
  • Execution of the sequences of instructions contained in memory 1106 can cause processor 1104 to perform the processes described herein.
  • hard-wired circuitry can be used in place of or in combination with software instructions to implement the present teachings.
  • implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.
  • “computer-readable medium” e.g., data store, data storage, etc.
  • “computer-readable storage medium” refers to any media that participates in providing instructions to processor 1104 for execution.
  • Such a medium can take many forms, including but not limited to, non-volatile media, volatile media, and transmission media.
  • non volatile media can include, but are not limited to, optical, solid state, and magnetic disks, such as storage device 1110.
  • Examples of volatile media can include, but are not limited to, dynamic memory, such as memory 1106.
  • Examples of transmission media can include, but are not limited to, coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 1102.
  • Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.
  • instructions or data can be provided as signals on transmission media included in a communications apparatus or system to provide sequences of one or more instructions to processor 1104 of computer system 1100 for execution.
  • a communication apparatus may include a transceiver having signals indicative of instructions and data.
  • the instructions and data are configured to cause one or more processors to implement the functions outlined in the disclosure herein.
  • Representative examples of data communications transmission connections can include, but are not limited to, telephone modem connections, wide area networks (WAN), local area networks (LAN), infrared data connections, NFC connections, etc.
  • the methodologies described herein may be implemented by various means depending upon the application. For example, these methodologies may be implemented in hardware, firmware, software, or any combination thereof.
  • the processing unit may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro controllers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGAs field programmable gate arrays
  • processors controllers, micro controllers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
  • the methods of the present teachings may be implemented as firmware and/or a software program and applications written in conventional programming languages such as C, C++, Python, etc. If implemented as firmware and/or software, the embodiments described herein can be implemented on a non-transitory computer-readable medium in which a program is stored for causing a computer to perform the methods described above. It should be understood that the various engines described herein can be provided on a computer system, such as computer system 1000 of Appendix B, whereby processor 1004 would execute the analyses and determinations provided by these engines, subject to instructions provided by any one of, or a combination of, memory components 1106/1108/1110 and user input provided via input device 1014.
  • the specification may have presented a method and/or process as a particular sequence of steps.
  • the method or process should not be limited to the particular sequence of steps described.
  • other sequences of steps may be possible. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims.
  • the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the various embodiments.
  • the embodiments described herein can be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
  • the embodiments can also be practiced in distributing computing environments where tasks are performed by remote processing devices that are linked through a network.
  • any of the operations that form part of the embodiments described herein are useful machine operations.
  • the embodiments, described herein also relate to a device or an apparatus for performing these operations.
  • the systems and methods described herein can be specially constructed for the required purposes or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer.
  • various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
  • Certain embodiments can also be embodied as computer-readable code on a computer- readable medium.
  • the computer-readable medium is any data storage device that can store data, which can thereafter be read by a computer system. Examples of the computer-readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical, FLASH memory and non-optical data storage devices.
  • the computer-readable medium can also be distributed over a network coupled to computer systems so that the computer-readable code is stored and executed in a distributed fashion.

Abstract

The disclosure relates to systems, software and methods for diagnosis or prognosis of subjects for metabolic syndromes (e.g., obesity, hypertension, cardiovascular diseases), including, classification and treatment of subjects who have been diagnosed with or deemed at risk of metabolic syndromes. The methods are based, in part, on the multimodal analysis of a plurality of features, e.g., whole-genome sequencing (WGS), microbiome, global metabolome, laboratory analysis, magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history.

Description

MULTIMODAL SIGNATURES AND USE THEREOF IN THE DIAGNOSIS AND
PROGNOSIS OF DISEASES
FIELD
[0001] The embodiments disclosed herein are generally directed towards systems and methods for performing multi-modal assessments of disease risk in patients. More specifically, there is a need for systems and methods that can use multiple modalities of data (e.g., whole genome sequencing, advanced imaging, metagenomic sequencing, metabolome, clinical labs, etc.) to make predictions about an individual’s health status.
BACKGROUND
[0002] Despite the enormous U.S. healthcare spending of $3.3 trillion in 2016, one in three individuals aged 50-74 years die prematurely from major age-related chronic diseases. Challenging the status quo of reactive healthcare, preventative medicine is the key to both help lower the healthcare cost and for better health. One way to address the gap between traditional medicine and current science on predictive, preventive medicine is via systems medicine.
[0003] Systems medicine is the application of systems biology to the challenges of human health and disease. An interdisciplinary approach that measures, integrates, analyzes, and interprets a variety of clinical and non-clinical data (i.e., modalities) is critical for a deeper understanding of the mechanisms that determine health and disease states. Significant computation and statistical analysis are required to sift through the large, diverse data and search for patterns, whether related to specified biological processes or to stratify complex diseases into distinct subtypes for health assessment.
[0004] Recent studies have shown the utility of collecting and analyzing diverse high-throughput data using unsupervised computational methods for more comprehensive insights into biological systems. Computational framework of unsupervised integration of heterogeneous data can lead to identification of major drivers of variation in chronic lymphocytic leukemia. Moreover, multimodal phenotyping platforms can provide a comprehensive, predictive, preventative, and personalized assessment of an individual's health status. However, partly due to the disparate nature of the types of data, integration of various data forms is not always straightforward. There is therefore a need for algorithms, systems and methods, which permit multimodal assessment of diseases based on dissimilar data forms. SUMMARY
[0005] Embodiments of the disclosure relate to multimodal assessment of metabolic diseases such as diabetes, hypertension and obesity, using a wide variety of genomic, imaging, metabolomics, and laboratory data. In some embodiments, the methods of the disclosure include whole genome sequencing, advanced imaging, metagenomic sequencing, metabolome, and clinical labs. The multimodal platform described herein not only allows identification of previously undiagnosed disease states but also to identify early disease biomarkers.
[0006] The systems and methods of the disclosure are built from a large cohort of with a wide range of data modalities, which allows for robust testing and/or validation of the associations between disease markers and the metabolic diseases. In some embodiments, the multimodal datasets comprise data from 1,253 self-assessed healthy adults (median age 53; 63% male). Furthermore, an independent female-only validation dataset consisting of 1,083 adults with longitudinal data was also included in the cohort. Using unsupervised machine learning approaches, a comprehensive analysis was conducted, enabling identification of novel signatures and/or patterns associated with disease risk. Based on these signatures, patients could be stratified and their current disease states and/or disease transition states be identified reliably and accurately.
[0007] Foundationally, the systems and methods of the disclosure include an amalgamation of machine learning analyses including cross-modality associations, formation of modules of densely connected features, which permitted identification of key biomarkers, clustering individuals into distinct health risk groups with corresponding biomarker signatures, and enrichment of longitudinal outcomes of individuals within each risk group. The systems and methods of the disclosure permit assessment of health status of subjects who are identified to be at risk and also of diseased subjects who are undergoing various types of lifestyle, dietary and/or therapeutic interventions.
[0008] In some embodiments, the disclosure relates to a method for diagnosing a metabolic syndrome in a subject, comprising, a) normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features; b) identifying statistically significant associations across the data features across each modality to identify correlations between the modalities and form a correlation network; c) analyzing structures of the correlations network by forming modules; d) performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module; e) partitioning a cohort of subjects into distinct health profiles with corresponding biomarker signatures; f) optionally integrating personal history and, further optionally integrating the longitudinal disease diagnosis data for each subject to strengthen the health profile of each subject; and g) determining a metabolic syndrome risk for each subject based on the health profile of each subject.
[0009] In some embodiments, the disclosure relates to a method for diagnosing a metabolic syndrome according to the foregoing or the following, wherein the modalities include whole- genome sequencing (WGS), microbiome, global metabolome, laboratory analysis, magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history.
[0010] In some embodiments, the disclosure relates to a method for diagnosing a metabolic syndrome according to the foregoing or the following, wherein the modality includes laboratory analysis comprising lab-developed tests for insulin resistance and prediabetes.
[0011] In some embodiments, the disclosure relates to a method for diagnosing a metabolic syndrome according to the foregoing or the following, comprising amalgamation of machine learning analyses including identification of significant cross-modality associations; formation of modules of densely connected features to identify key biomarkers; clustering individuals into distinct health risk groups with corresponding biomarker signatures; and enrichment of longitudinal outcomes of individuals within each risk group.
[0012] In some embodiments, the disclosure relates to a computer readable medium comprising computer-executable instructions, which, when executed by a processor, cause the processor to carry out a method or a set of steps for diagnosing a metabolic syndrome in a subject, the method or the set of steps comprising, a) normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features; b) identifying statistically significant associations across the data features across each modality to identify correlations between the modalities and form a correlations network; c) analyzing structures of the correlations network by forming modules; d) performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module; e) partitioning a cohort of subjects into distinct health profiles with corresponding biomarker signatures; f) optionally integrating personal history and, further optionally integrating the longitudinal disease diagnosis data for each subject to strengthen the health profile of each subject; and g) determining a metabolic syndrome risk for each subject based on the health profile of each subject.
[0013] In some embodiments, the disclosure relates to a system for diagnosis of a metabolic syndrome , comprising: a) a normalizer for normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features; b) a concatenate engine for identifying statistically significant associations across the data features across each modality to identify correlations between the modalities to form a correlations network; c) a structure analyzer for analyzing structures of the correlations network by forming modules; d) a graphical analyzer for performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module; e) a clustering module for partitioning a cohort of subjects into clusters of subjects with distinct health profiles with corresponding biomarker signatures; f) an integrator for optionally integrating each subject’s personal history and, further optionally integrating each subject’s longitudinal disease diagnosis data to strengthen the health profile of each subject; and g) a risk assessor for assessing a metabolic syndrome risk for each subject based on the health profile of each subject.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The details of one or more embodiments of the disclosure are set forth in the accompanying drawings/tables and the description below. Other features, objects, and advantages of the disclosure will be apparent from the drawings/tables and detailed description, and from the claims.
[0015] FIGS. 1A-1E depicts an outline of how a multimodal health assessment can be performed, in accordance with various embodiments. FIG. 1A shows various modalities of features that can be collected from individuals. In the study, multi-modal data (n= 1,385 features) from 1,253 individuals were collected. FIG. IB shows that data can be analyzed by performing cross-modality associations on Gaussian transformed features after correcting for age, sex and ancestry. FIG. 1C shows that using the associations, community detection analysis can be performed and modules of densely connected features can be identified. FIG. ID shows that conditional independence network analysis (also referred to as a Markov Network) could be performed to reduce the correlatedness of features and identify key biomarker features. FIG. IE shows that individuals could be clustered into distinct groups of health profiles using the identified biomarkers of the disclosure, which clusters were then used to perform disease risk enrichment analysis.
[0016] FIGS. 2A-2B depict the results of cross-modality correlations for various pairings of modalities, in accordance with various embodiments. FIG. 2A shows the number of significant cross-modality correlations for each pair of modalities. The percentages shown are the proportion of correlations that were significant out of all possible pairwise associations between the modality- pair. FIG. 2B shows associations between p-cresol sulfate metabolite and (top) abundance of Intestinimonas genus, and (bottom) an abundance of unclassified genus in Erysipelotrichaceae family. [0017] FIGS. 3A-3B depict a multi-modal cardiometabolic module used in an analysis cardiovascular and metabolic disease risk, in accordance with various embodiments. FIG. 3A shows identification of key biomarker features that represent the cardiometabolic module using Markov network analysis. These reduced interactions highlight the most important associations after removing edges corresponding to indirect associations. It was observed that the microbiome genera Butyrivibrio and Pseudoflavonifractor are the most relevant microbiome genera in the context of this module that interfaces with features from other modalities. FIG. 3B shows clustering of individuals using the key biomarkers. The heatmap shows z-statistics from logistic regression for an association between each cluster and each feature. The plot on the left shows the 22 key cardiometabolic biomarkers. The plots on the right show significant associations that emerged from an analysis against the full set of 1,385 features. The first plot begins with the features that had significant associations with multiple clusters, and the remaining plots show features that were significantly associated with only one feature. Some correlated features have been collapsed, with the mean z-statistics displayed; the full set of features can be found in FIGS. 6A and FIG. 6B. Met = metabolome.
[0018] FIGS. 4A-4B depicts disease enrichment and longitudinal outcomes of cardiometabolic clusters, in accordance with various embodiments. FIG. 4A shows bar plots showing the prevalence of disease at baseline (combined Discovery and TwinsUK baseline cohorts; FIG. 7A and FIG. 7B show them individually) and the incidence of disease (i.e., only the new cases of disease) after a median of 5.6 years of follow-up (TwinsUK cohort). *p<0.05, **p<0.005. FIG. 4B shows the rates at which individuals from each cluster transition into other clusters after a median of 5.6 years of follow-up. The plot shows individuals per cluster (1 to 7) at baseline visit that transition to other clusters during the follow-up.
[0019] FIGS. 5A-5B depicts a multi-modal microbiome richness module used in analysis of diversity in an individual’s gut microbiome, in accordance with various embodiments. FIG. 5A shows the identification of key biomarker features that represent the microbiome richness module using Markov network analysis. FIG. 5B shows the clustering of individuals using the key biomarkers. The heatmap shows z-statistics from logistic regression for an association between each cluster and each feature. The plot on the left shows the 24 key biomarkers representing the module. Met = metabolome.
[0020] FIGS. 6A-6B show further details of the clustering of individuals in the multi-modal cardiometabolic module analysis, in accordance with various embodiments. The heatmap shows the Z-statistics from a logistic regression for an association between each cluster and each feature. The plot on the left shows the 22 key cardiometabolic features. The plots on the right show significant associations that emerged from an analysis against the full set of 1,385 features. The first plot begins with the features that had significant associations with multiple clusters, and the remaining plots show features that were significantly associated with only one feature. The highlighted groups (e.g., Lipid Group 1, Lipid Group 2, etc.) are largely internally redundant and were collapsed in FIG. 4B by plotting their mean Z-statistics value. All features replicated in the TwinsUK validation cohort (with some exceptions).
[0021] FIGS. 7A-7B depict bar plots showing prevalence of disease diagnoses, in accordance with various embodiments. In particular, FIG. 7A and FIG. 7B show the Discovery and TwinsUK cohorts, respectively, at baseline. The combined cohort (*p<0.05, **p<0.005) is shown in FIG.
4A.
[0022] FIGS. 8A-8B show further details of the clustering of individuals in the multi-modal microbiome richness module analysis, in accordance with various embodiments. The heatmap shows the Z-statistics from a logistic regression for an association between each cluster and each feature. The plot on the left shows the 24 key biomarkers. The plots on the right show significant associations that emerged from an analysis against the full set of 1,385 features. The first plot begins with the features that had significant associations with multiple clusters, and the remaining plots show features that were significantly associated with only one feature. All features replicated in the TwinsUK validation cohort (with some exceptions).
[0023] FIG. 9 is an exemplary flowchart showing a method for diagnosing metabolic syndrome in a subject, in accordance with various embodiments.
[0024] FIG. 10 is an illustration of a system for diagnosing metabolic syndrome in subjects, in accordance with various embodiments.
[0025] FIG. 11 is a block diagram that illustrates a computer system, in accordance with various embodiments.
[0026] It is to be understood that the figures are not necessarily drawn to scale, nor are the objects in the figures necessarily drawn to scale in relationship to one another. The figures are depictions that are intended to bring clarity and understanding to various embodiments of apparatuses, systems, and methods disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Moreover, it should be appreciated that the drawings are not intended to limit the scope of the present teachings in any way.
DETAILED DESCRIPTION
[0027] The disclosure relates to various exemplary embodiments of systems and methods for performing multi-modal assessments of disease risk in patients. The disclosure, however, is not limited to these exemplary embodiments and applications or to the manner in which the exemplary embodiments and applications operate or are described herein. Moreover, the figures may show simplified or partial views, and the dimensions of elements in the figures may be exaggerated or otherwise not in proportion. In addition, as the terms“on,”“attached to,”“connected to,”“coupled to,” or similar words are used herein, one element (e.g., a material, a layer, a substrate, etc.) can be“on,”“attached to,”“connected to,” or“coupled to” another element regardless of whether the one element is directly on, attached to, connected to, or coupled to the other element or there are one or more intervening elements between the one element and the other element. In addition, where reference is made to a list of elements (e.g., elements a, b, c), such reference is intended to include any one of the listed elements by itself, any combination of less than all of the listed elements, and/or a combination of all of the listed elements. Section divisions in the specification are for ease of review only and do not limit any combination of elements discussed.
[0028] Unless otherwise defined, scientific and technical terms used in connection with the present teachings described herein shall have the meanings that are commonly understood by those of ordinary skill in the art. Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular. Generally, nomenclatures utilized in connection with, and techniques of, cell and tissue culture, molecular biology, and protein and oligo- or polynucleotide chemistry and hybridization described herein are those well-known and commonly used in the art. Standard techniques are used, for example, for nucleic acid purification and preparation, chemical analysis, recombinant nucleic acid, and oligonucleotide synthesis. Enzymatic reactions and purification techniques are performed according to manufacturer’s specifications or as commonly accomplished in the art or as described herein. The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification. See, e.g. , Sambrook et al. , Molecular Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000). The nomenclatures utilized in connection with, and the laboratory procedures and techniques described herein are those well-known and commonly used in the art.
[0029] The phrase“next generation sequencing” (NGS) refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches, for example with the ability to generate hundreds of thousands of relatively small sequence reads at a time. Some examples of next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization. More specifically, the MISEQ, HISEQ and NEXTSEQ Systems of Illumina and the Personal Genome Machine (PGM) and SOLiD Sequencing System of Life Technologies Corp, provide massively parallel sequencing of whole or targeted genomes. The SOLiD System and associated workflows, protocols, chemistries, etc. are described in more detail in PCT Publication No. WO 2006/084132, entitled“Reagents, Methods, and Libraries for Bead-Based Sequencing,” international filing date Feb. 1, 2006, U.S. patent application Ser. No. 12/873,190, entitled“Low- Volume Sequencing System and Method of Use,” filed on Aug. 31, 2010, and U.S. patent application Ser. No. 12/873,132, entitled“Fast-Indexing Filter Wheel and Method of Use,” filed on Aug. 31, 2010, the entirety of each of these applications being incorporated herein by reference thereto
[0030] As used herein, the phrase“genomic features” can refer to a genome region with some annotated function (e.g., a gene, protein coding sequence, mRNA, tRNA, rRNA, repeat sequence, inverted repeat, miRNA, siRNA, etc.) or a genetic/genomic variant (e.g., single nucleotide polymorphism/variant, insertion/deletion sequence, copy number variation, inversion, etc.) which denotes a single or a grouping of genes (in DNA or RNA) that have undergone changes as referenced against a particular species or sub-populations within a particular species due to mutations, recombination/crossover or genetic drift.
[0031] Genomic variants can be identified using a variety of techniques, including, but not limited to: array-based methods (e.g., DNA microarrays, etc.), real-time/digital/quantitative PCR instrument methods and whole or targeted nucleic acid sequencing systems (e.g., NGS systems, Capillary Electrophoresis systems, etc.). With nucleic acid sequencing, coverage data can be available at single base resolution.
[0032] As used herein, "substantially" means sufficient to work for the intended purpose. The term "substantially" thus allows for minor, insignificant variations from an absolute or perfect state, dimension, measurement, result, or the like such as would be expected by a person of ordinary skill in the field but that do not appreciably affect overall performance. When used with respect to numerical values or parameters or characteristics that can be expressed as numerical values, "substantially" means within ten percent.
[0033] The term "ones" means more than one.
[0034] As used herein, the term“plurality” can be 2, 3, 4, 5, 6, 7, 8, 9, 10, or more.
[0035] The structures, materials, compositions, and methods described herein are intended to be representative examples of the disclosure, and it will be understood that the scope of the disclosure is not limited by the scope of the examples. Those skilled in the art will recognize that the disclosure may be practiced with variations on the disclosed structures, materials, compositions and methods, and such variations are regarded as within the ambit of the disclosure. Data Collection and Data Features
[0036] For the study supporting the multi-modal systems and methods disclosed herein, data was collected from 1,253 self-assessed healthy individuals in a clinical research and discovery center. Several tools and techniques referred to as modalities were used to collect the data. The modalities included whole genome sequencing (WGS), microbiome sequencing, global metabolome, insulin resistance (IR) and glucose intolerance (IGT) laboratory developed tests (Quantose™), whole body and brain magnetic resonant imagining (MRI), dual-energy x-ray absorptiometry (DEXA), computed tomography (CT) scan, routine clinical labs, personal/family history of disease and medication, and vitals/anthropometric measurements. The number of individuals and number of features per modality are summarized in Table 1.
[0037] Table 1: Table shows number of individuals and number of features measured per modality. The body composition features from DEXA and MRI were combined and treated as a separate modality (“Body composition”). MRI = magnetic resonance imaging; DEXA = dual energy X-ray absorptiometry; CT = Computed tomography.
Figure imgf000011_0001
[0038] CT scans were performed on individuals over the age of 35 years. Patients were scanned during a single breath-hold using a 64-slice GE Healthcare EVO Revolution scanner (GE Healthcare, Milwaukee, Wisconsin). Gated axial scans with 2.5 mm slice thickness were performed using a tube energy of 120 kVp and the tube current adjusted for individuals' body mass index. Images were subsequently analyzed using an AW VolumeShare 7 workstation (GE Healthcare, Milwaukee, Wisconsin) and regions of coronary calcification were manually identified in order to compute Coronary Artery Calcium (CAC) Agatstan scores. Multi-Ethnic Study of Atherosclerosis (MESA) reference CAC values were used to calculated the percentile of calcification for each individual matched for age, sex and ethnicity.
[0039] For microbiome sequencing, whole genome sequencing was performed on stool samples to analyze the microbial communities. For this modality, the features included species richness, species diversity, the fraction of human DNA, Proteobacteria, and the abundance of 72 genera. Microbiome species richness is defined as the number of species present at a relative abundance greater than 104. Microbiome species diversity is defined as the Shannon entropy of the taxon abundance vector.
[0040] Whole-genome sequencing data was used to compute the following features: polygenic risk scores (PRS) for 51 diseases and traits, HLA-type, 30 known Short Tandem Repeats (STR) disease loci, and known rare pathogenic variants from ClinVar (Set-l and Set-2 from Shah et al). Ancestry was also computed using the method described in Telenti et al. from WGS data.
Data Normalization
[0041] Several features were correlated with age, sex or ancestry. To remove this bias, we first identified which covariates among age, sex and ancestry (first four principal components) were significantly associated with each feature at a p < 0.01 significance level, using, multiple linear regression. Then, the feature values were replaced with the residues after regressing the associated covariates.
[0042] To deal with non-Gaussian distributions of various features from several modalities, a rank- based inverse normal transformation was applied. To identify these features with non-Gaussian distributions, a criteria where more than 40% of the samples had to have the same value was applied. The rank-based inverse normal transformation was then applied to the microbiome abundance data as the features tended to be non-Gaussian.
Constructing Multimodal Correlation Modules
[0043] A Spearman correlation was performed and then calculated a p-value for each cross modality pair of features. Correlation was calculated only if at least 30 samples had data for the pair of features. Statistically significant associations were selected using the Benjamini-Hochberg approach by controlling the false discovery rate at 5%.
[0044] The significant associations were used to construct a network where each feature from each modality is a node, and the corresponding association p-value is an edges between two features from different modalities. The weight of an edge is defined as -log( ), where p is the p-value of the corresponding spearman correlation. The Louvain algorithm was then used to perform community detection. This algorithm was chosen as it is more aggressive and removes more edges compared to art-known algorithms. Each of the“communities” computed by the algorithm were referred to as modules.
[0045] Nodes were allowed to belong to multiple modules. This was allowed only when a node was assigned to a module by the community detection algorithm but had more than 20 significant associations in another module (or more associations with another module than it had with its assigned module).
Key Biomarker Selection and Conditional Independence Network Construction
[0046] To perform a deeper analysis of the cardiometabolic and microbiome richness modules, a list of candidate biomarkers were initially selected using eigenvector centrality. More precisely, for the subnetwork corresponding to each of these two modules, all the nodes were ranked according to their eigenvector centrality score. For the cardiometabolic module, the 50 most central features were selected and for the microbiome richness module, the 40 most central features were selected.
[0047] For each of the two modules, after mean-imputing missing data, these central features were used to construct a sparse network using the GraphLasso method. This method estimates the inverse covariance matrix of the selected feature using a lasso penalty to induce sparsity. The resulting network is a conditional independence network (also known as a“Markov Network”) in the sense that the absence of an edge between two features implies that they are approximately conditionally independent given the remaining features in the Markov network. Unlike in cross modality correlation network, here connections between features from the same modality were allowed and are typically the strongest associations. This method tends to be less sensitive than the pairwise Spearman associations initially computed and several weak cross-modality associations observed in the general cross-modality association network were not observed in the Markov network.
Clustering Individuals into Distinct Health Profiles
[0048] The Markov network allows the identification of features that were only associated with the rest of the network through other features in its own modalities. By dropping such features, a final set of biomarkers was obtained, both for the cardiometabolic module and for the microbiome richness module. [0049] These key biomarkers were then utilized to cluster the individuals in the cohort. Individuals were selected based on whether they had key biomarkers (for a total of 668 individuals for the cardiometabolic module, and 640 individuals in the microbiome diversity module). The resulting data matrix had each feature scaled to have zero mean and unit variance. The missing values were imputed using softlmpute.
[0050] Hierarchical clustering was performed on the set of individuals based on complete linkage and a correlation distance metric, and extracted clusters from the dendrogram.
Statistical associations between clusters and other trait
[0051] The rates of disease diagnoses and medication use were compared across the seven cardiometabolic and the seven microbiome richness clusters. A Fisher’s exact test was used with the fisher.test command in R (using Monte Carlo simulated p- value and 1E6 number of replicates used in the Monte Carlo test) to test for statistical significance after correcting for multiple tests.
[0052] The individuals in each cluster were also compared to all individuals not in that cluster for each of the 1,354 features using a logistic regression with the glm command in R. There was thus a separate analysis performed for cluster 1 vs. everyone else, cluster 2 vs. everyone else, etc. Significant associations were those that survived correction for multiple tests.
Validation Cohort
[0053] For validation of our findings, we utilized 1,083 individuals from a study cohort (referred to here as“TwinsUK”) of largely European ancestry female twins enrolled in the TwinsUK registry, a British national register of adult twins. The cohort included data from WGS, metabolome, microbiome, DEXA, clinical blood labs, and personal history of disease and medication. The data from the modalities was collected from three longitudinal visits over the course of a median of 13 years. To capture a population with adequate sample sizes for the overlapping modalities used in the present study, the analysis was restricted to data from visit 2 (referred here as“baseline”) and visit 3 (referred here as“follow-up”). Microbiome samples were only collected at visit 3. To be included in the analysis, phenotyping measurements were required to be collected within 90 days of the metabolome draw for each visit, or within 6 months for microbiome. For the validation of metabolome and microbiome correlations, only one of the twins was used to avoid bias from relatedness, totaling 538 individuals. For the cardiometabolic module analysis, liver fat, Gamma-Glutamyl Transferase (GGT), IGT, IR and glucose were imputed using regularized linear regression with Ll penalty (R glmnet package). Results
[0054] Data was collected from 1,253 self-assessed healthy adults (median age 53; 63% male) across several modalities (FIG. 1A), including whole-genome sequencing (WGS), microbiome, global metabolome, laboratory-developed tests for insulin resistance and prediabetes (Quantose), magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history. Not all individuals were measured for all modalities (Table 1). For each of the modalities, several data features were measured, totaling to 1,385 features from all modalities (see Methods). A majority of the cohort were of European ancestry (71.6%). The remainder were of East Asian (6.4%), Central/South Asian (3.4%), Middle Eastern (0.4%), African (0.3%), and admixture ancestries (18.0%).
[0055] Four main analyses were performed using the collected multimodal data and is summarized in FIG. 1. First, statistically significant associations were spotted across the data modalities to identify novel correlations (FIG. IB). Second, the structure of the resulting correlation network was analyzed by forming“modules” (FIG. 1C). Third, an in-depth analysis of selected modules was performed using probabilistic graphical models to identify a“network” of key biomarkers that represents the module (FIG. ID). Fourth, using the key biomarkers, clustering of individuals was performed to partition the study cohort into distinct health profiles with corresponding biomarker signatures (FIG. IE). The clusters were further clarified and disease risk was inquired using individuals’ personal history and, when available, longitudinal disease diagnosis data. The main findings were verified using an independent validation dataset consisting of 1,083 females.
Multimodal Correlations and Modules
[0056] Correlations for each cross-modality pair of normalized features were calculated and a list of 11,537 statistically significant associations was selected out of 427,415 total cross-modality comparisons (see Methods). A breakdown of the selected associations per pair of modalities is shown in FIG.2A. The largest number of significant associations (n=5,570) was observed between metabolome and clinical labs. This is mainly explained by separate measurements of the same or similar metabolites by the two modalities. The second largest number of significant associations (n=2,03l) was between the metabolome and microbiome, given the large number of features measured in the two modalities (3% were significant of possible correlations between the two modalities; FIG. 2A). The third largest number of significant associations (n=l,858; 17%) was between metabolome and body composition. [0057] The most significant associations, apart from those between metabolome and labs, were expected correlations supporting well-established prior clinical research. Examples include associations between body mass index (BMI) and liver fat percentage (p = 1.35E-46) and between visceral adipose tissue (VAT) and insulin resistance (IR) score (p-value = 2.09E-44). These correlations highlight the importance of preventative medicine recommendations for reducing BMI and VAT, which are known risk factors for diabetes and other metabolic syndromes. Height and polygenic risk scores (PRS) were observed for height to be significantly correlated ( p = 2.32E- 44), highlighting the utility of genetics for trait prediction. Other significant genetic associations were observed between PRS of lipid levels (high density lipoprotein, low density lipoprotein, total cholesterol, and triglyceride) and their corresponding lab measurements, HLA type DRB 1*04 was correlated with alpha- 1 globulin levels (p = 4.87E-05), and a known short tandem repeat in gene ATN1 associated with Dentatorubro-pallidoluysian atrophy (DRPLA) was correlated with serum urate levels (p = 6.77E-05). Overall, less than one percentage of associations with genetics features observed were significant. Conversely, body composition features (e.g. BMI, VAT, android/gynoid ratio, fat mass and lean mass) had highest percentages of significant associations with several modalities (FIG. 2A).
[0058] In addition, novel associations between the metabolite -cresol sulfate (pCS) and microbiome genera were observed including Intestinimonas and an unclassified genus in Erysipelotrichaceae family (p = 2.92E-24 and p = 2.98E-20 respectively; FIG. 2B). pCS is a microbial metabolite associated with accelerated cardiovascular disease and renal disease progression, a potential uremic toxin. It is a sulfated phenolic compound generated in the colon by bacterial fermentation of tyrosine. pCS was also associated with species diversity (p = 6.54E-19), and several genera ( Pseudoflavonifractor , Anaerotruncus, an unclassified genus, Subdoligranulum, and Ruminiclostridium) in Ruminococcaceae family (p = 9.52E-32, p = 1.39E- 23, p = 1.95E-38, p = 9.48E-19, and p = 3.26E-11 respectively). The associations of pCS with species diversity and Ruminococcaceae family has previously been observed but not with Intensitnimonas and a genus in Erysipelotrichaceae family. The associations were validated in an independent TwinsUK cohort (see Methods; Table 2).
[0059] Table 2: The table shows microbiome genera that are associated with a metabolite -cresol sulfate in both the discovery cohort and the replication cohort.
Figure imgf000017_0001
[0060] The significant associations from all modalities were then used to construct a network, which was used to find highly connected sets of variables that we refer to as modules (see Methods). To avoid having the structure of the correlation network be heavily determined by metabolome and clinical lab associations, we removed all the corresponding edges. The method found two modules with by far the largest number of connections (h>100 each) and numerous smaller modules.
[0061] Several of these modules permit biological interpretations. The largest of them was a cardiometabolic module containing many markers associated with cardiac disease and metabolic syndrome. The second largest module was predominantly made up of microbiome taxa abundance, and metabolites that are known to be biomarkers for diversity in the gut microbiome. This is referred to as the microbiome richness module. Here, a detailed analysis on these two largest modules is presented.
Cardiometabolic Module
[0062] The cardiometabolic module in the association network contained 355 nodes from clinical labs, metabolome, quantose, CT, microbiome, vitals, genetics, MRI-body and body composition data modalities. The features in this module were ranked by their relative centrality in the module using eigenvector centrality score (see Methods), and several markers associated with obesity, heart disease, and metabolic syndrome were verified. Thus, the module was assigned its name - cardiometabolic module. The most central features for the module were VAT, BMI, liver fat percentage, lean mass percentile, glucose levels, blood pressure, triglycerides levels, IR score, several lipid metabolites, and several microbiome genera, including butyrate -producing bacterium genera such as Pseudoflavonifractor, Butyrivibrio , Intestinimonas, and Faecalibacterium.
Network Analysis for Key Biomarker Selection
[0063] While the module provides a general overview of how these features are interconnected, its construction is based only on pairwise associations. As such, it contains a significant amount of redundancy (e.g., two metabolites from the same pathway are likely to be connected to the same features from other modalities) and transitive edges (i.e., if and A and B are associated, and B and C are associated, an association between A and C is likely to be observed). In order to obtain a more meaningful representation of the interaction between the features in the module, the 50 most central features were picked and the inverse covariance matrix was computed. This matrix defines a new network (called the Markov network ) on these 50 features with the property that features A and B are only connected if they are correlated conditioned on all other features. The resulting network is shown in FIG. 3A.
[0064] The Markov network emphasizes the most direct connections in the module. It suggests that (a) microbiome genera Butyrivibrio and Pseudoflavonifractor are“closest” to the remainder of the cardiometabolic module via a lipid metabolite l-(l-enyl-palmitoyl)-2-oleoyl-GPC (P- 16:0/18:1) and serum triglyceride, (b) systolic and diastolic blood pressure are mostly redundant from the point of the central variables in the module, demonstrated by the thickness of the edges, and (c) liver iron and gamma-tocopherol/beta-tocopherol are only associated to the rest of the module through other variables in their respective modalities. These observations allows a determination of a pruned set of 22 key cardiometabolic features (referred to as key biomarkers).
[0065] The key biomarkers included known and expected features for cardiac and metabolic conditions (such as BMI, blood pressure, glucose levels and HDL) but also novel biomarkers (such as several metabolites and microbiome genera) that distinguishes susceptibility to disease morbidity (FIG. 3A). High abundance of the microbiome genera Butyrivibrio and Pseudoflavonifractor were well correlated with good cardiometabolic health. Several metabolites stood out as markers for the healthy profiles such as l-(l-enyl-palmitoyl)-2-oleoyl- glycero-3- phosphocholine (GPC) and l-eicosenoyl-GPC, and for the unhealthy profiles such as glutamate, butyrylcarnitine, lactate, l-stearoyl-2-dihomo-linolenoyl-GPC, and l-palmitoleoyl-2-oleoyl- glycerol. Clustering of Individuals and characterization
[0066] To assess the relationship between the health status of individuals and these 22 key biomarkers, individuals were stratified using hierarchical clustering. This clustering resulted in seven groups, each with a unique biomarker profile (FIG. 3B). To better characterize the clusters, each cluster was compared to the full set of 1,385 features. 106 features were identified beyond the 22 used to calculate the cardiometabolic clusters that were significantly (logistic regression p < 5.1E-06) enriched in at least one cluster compared to the others (selected features shown in FIG. 3B; all 106 features shown in FIG. 8). Of these, 78 features were also measured in our validation cohort of 1,083 individuals (TwinsUK baseline), corresponding to 135 total associations between different clusters and features; 77.8% of the associations replicated in our validation cohort (p < 3.9E-04). Of the remaining 30 associations, 90.0% at least had directions of effect that were consistent between the cohorts and may show statistical significance in a larger sample size.
[0067] The individuals in cluster 1 can be characterized as containing perceived healthiest individuals, with a markedly higher lean mass percentile and low IR score. This cluster is notable for its lower blood pressure, lower butyrylcamitine levels, and higher HDL. The IR score and lean mass percentile for cluster 2 and 3 were not as healthy as those of cluster 1. In addition, cluster 2 displays the lowest glutamate values, while cluster 3 is characterized by the lowest blood pressure and the highest levels of 3-hydroxybutyrate. Cluster 4 is distinguished by an Impaired Glucose Tolerance (IGT) score that is higher than in the other clusters with healthy individuals and high levels of Apolipoprotein-A (Apo-A) and HDL cholesterol. Cluster 5 contains largely overweight individuals who nonetheless have low IR scores and low IGT. Cluster 6 contains mostly overweight and obese individuals with high android/gynoid ratios and IR scores who were specifically characterized by high the highest Apo-B, cholesterol in very low-density lipoprotein and triglycerides of any cluster. Cluster 7 contains the least healthy individuals with respect to the markers in consideration, with a high prevalence of obesity, body fat and insulin resistance.
[0068] In addition to associations with features, rates of cardio-metabolic conditions (i.e., diabetes, hypertension, hypercholesterolemia, heart disease, and stroke) between the clusters were compared. Significant differences between clusters in their rates of diabetes and hypertension diagnoses (Fisher’s exact p = 1.0E-04 and 2.3E-04, respectively) were found. The findings were confirmed in the validation cohort (Fisher’s exact p = 4.3E-04 and < 1.0E-06, respectively) (FIG. 4). Specifically, cluster 7 had significantly higher rates of diabetes, while cluster 1 had significantly lower rates of diabetes and hypertension. There were no significant differences between the clusters in heart disease or strokes, though hypercholesteremia showed a trend toward group differences in both cohorts that requires further validation (p = 0.03 in the discovery cohort; p = 0.09 in the validation cohort).
[0069] Interestingly, cluster membership was a better predictor of diabetes diagnoses than were the traditional clinical features used to determined diabetes status: glucose, IGT score, and IR score, as well as BMI. Even after accounting for these traditionally predictive features, individuals in cluster 7 were significantly more likely to have diabetes than were members of the other clusters (logistic regression p = 0.01). The enrichment of hypertension diagnosis in cluster 7, however, was explained by blood pressure measurement, as expected.
[0070] The cardiometabolic key biomarkers that were the largest drivers of this association between diabetes and cluster 7 were the IR score, percent lean body mass, and the metabolites 1- stearoyl-2-dihomo-linolenoyl-GPC (18:0/20:3h3 or 6) and l-(l-enyl-palmitoyl)-2-oleoyl-GPC (P- 16:0/18:1). In a multivariable logistic regression containing the 22 cardiometabolic key biomarkers, the above mentioned were the four features that were significantly associated with diabetes status. They were also significant predictors of cluster 7, in addition to liver fat, HDL cholesterol, Pseudoflavonifractor, and the metabolites lactate and l-eicosenoyl-GPC (20: 1)).
[0071] Next features were identified that distinguished those in cluster 7 who did and did not have diabetes. These two groups were compared within cluster 7 for all 1,385 features, and the metabolite citrulline emerged as by far the best predictor of diabetes status, with decreased levels of citrulline being found in diabetes patients (logistic regression p = 5.5E-06). In a multivariable logistic regression model across clusters that incorporated citrulline, IR score, IGT score, glucose, and BMI, cluster 7 retained its significant difference from the other clusters (p = 0.01), but citrulline had by far the best explanatory power for diabetes status (p = 9.0E-09). Indeed, searching throughout the 1,385 features for associations with diabetes in all individuals, beyond cluster 7, citrulline remained by far the best predictor of this disease. This association was validated in the validation cohort ( p = 0.002). Citrulline was associated with diabetic individuals taking metformin.
Discussion
[0072] In this study, a total 1,385 multi-modal features collected from 1,253 individuals were analyzed using several machine learning and statistical approaches to find signatures of health and disease. Association analysis of cross-modality features was performed and 11,537 statistically significant cross-modal feature associations (FDR < 0.05) were found. Using the significant associations, community detection analysis was performed to find modules of densely connected features. The analysis identified a cardiometabolic module and a microbiome richness module. Using Markov network analysis, a set of key biomarkers for each module was identified. These biomarker signatures stratified individuals into clusters, each representing a unique health profile. The prevalence of diseases for each of the profiles was studied and clusters of individuals who had either very low prevalence of or were enriched for cardiac and metabolic conditions were identified. The main findings were replicated in an independent validation cohort of 1,083 females (TwinsUK). Additionally, a longitudinal disease incidence analysis was performed to find early disease signature was performed using the TwinsUK cohort.
[0073] Novel significant associations between -cresol sulfate (pCS) and microbiome genera Intestinimonas and an unclassified genus in Erysipelotrichaceae family were identified. pCS is a microbial metabolite and is often considered to be a uremic toxin. It is produced by bacteria fermenting undigested dietary proteins that escape absorption in the small bowel. It appears to be elevated in the sera of chronic kidney disease (CKD) patients, and it is associated with increased mortality in patients with CKD and an increased risk of cardiovascular events. Intestinimonas is known for its butyrate producing species by digesting lysine and fructoselysine in the human gut, but is otherwise not well described. Members of Erysipelotrichaceae family might be immunogenic and can potentially flourish post-treatment with broad spectrum antibiotics. An increased abundance of Erysipelotrichaceae has been observed in obese individuals, and several other evidences suggests its role in lipid metabolism. These novel associations were validated in TwinsUK and could further be analyzed for therapeutic targets to decrease pCS levels and its toxicity.
[0074] A cardiometabolic module with key biomarkers consisting of novel features in addition to the traditional clinical features from several modalities was identified. The potentially novel biomarkers included abundance of the microbiome genera Butyrivibrio and Pseudoflavonifr actor and several metabolites, such as l-(l-enyl-palmitoyl)-2-oleoyl-GPC, l-eicosenoyl-GPC, glutamate, and l-stearoyl-2-dihomo-linolenoyl-GPC. Clustering of individuals using the key biomarker revealed signatures of disease states. In particular, profiles for healthy individuals were identified, which are consistent with very low prevalence of diabetes, hypertension, and obesity (Cluster 1) and a profile for individuals displaying comorbidity for diabetes (Cluster 7). The cluster membership for individuals was a better predictor of diabetes than the traditional clinical biomarkers such as glucose, BMI and insulin resistance. The novel biomarkers in the diabetes signature included l-stearoyl-2-dihomo-linolenoyl-GPC and l-(l-enyl-palmitoyl)-2-oleoyl-GPC. Longitudinal disease outcome analysis using follow-up TwinsUK data found early disease signature for hypertension (Cluster 6). If was observed that more than half of the individuals from cluster 6 transitions to cluster 7 (the unhealthiest cluster) in the follow-up visit, indicative of early precursor to a poor health outcome. These signature can be used to prioritize individuals for intervention. Analysis of the microbiome richness module revealed a xenobiotics metabolite cinnamoylglycine as a potential biomarker for health associated with microbiome species richness and lean mass percentage. Cinnamoylglycine is observed to be abundant in individuals in cluster 1 representing healthy individuals.
[0075] Overall, a substantial number of significant findings were not observed using genetic features. This could be explained by a combination of reasons: the analyses focuses on main findings from unsupervised pattern detection and an overwhelming signal from other functional measurements dampens signals from genetics, lack of observed rare diseases in this comparatively smaller cohort especially for association with short tandem repeats and rare variants, and PRS for several traits explain only a small variance thus not enough power to detect associations for all traits.
[0076] In recent years, several organizations have begun gathering cohorts with high throughput data from multiple modalities. The collection of such datasets on large cohorts is necessary in systems medicine to gain a comprehensive insights into an individual’s health status and to understand disease mechanisms. A systematic approach of analyzing individual’s deep phenotype data is important for precision medicine screening. However, it is also crucial to perform unsupervised multimodal data analyses, as described here, to sift through this wealth of information for novel findings of signatures of health and disease. Thus transitioning towards personalized, preventative health risk assessments.
[0077] Analysis of 1,385 data features from diverse modalities, including metabolome, microbiome, genetics and advanced imaging, from 1,253 individuals and from a longitudinal validation cohort of 1,083 individuals was conducted using a rigorous, multimodal approach. Unsupervised machine learning analyses was performed to find biomarker signatures of health and disease risk. Novel associations of a potential uremic toxin, -cresol sulfate, with microbiome genera Intestinimonas and an unclassified genus in Erysipelotrichaceae family were found. A network analysis and clustering of individuals revealed signatures of disease states, particularly for diabetes, hypertension and obesity. Cluster membership for individuals was a better predictor of diabetes than the traditional clinical biomarkers such as glucose, BMI and insulin resistance. The novel biomarkers in the diabetes signature included l-stearoyl-2-dihomo-linolenoyl-GPC and l-(l-enyl-palmitoyl)-2-oleoyl-GPC. Early disease signature for hypertension was identified, and individuals at-risk for a poor health outcome. A xenobiotics metabolite cinnamoylglycine was found as a potential biomarker for health associated with microbiome species richness and lean mass percentage. Through multimodal data integration, novel associations were identified and biomarker signatures that stratify individuals into distinct disease subtypes, including early disease states; an essential step towards personalized, preventative health risk assessment.
[0078] FIG. 9 is an exemplary flowchart showing a method for diagnosing metabolic syndrome in a subject, in accordance with various embodiments. As depicted herein, method 900 details an exemplary method for determining metabolic syndrome risk for an individual, in accordance with various embodiments. In step 902, heterogenous data features are derived from a plurality of modalities, wherein each modality comprises a plurality of data features. In various embodiments, the modalities include, but are not limited to: whole-genome sequencing (WGS), microbiome, global metabolome, laboratory analysis, magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history.
[0079] In step 904, statistically significant associations are identified across the data features across each modality to identify correlations between the modalities and form a correlations network.
[0080] In step 906, structures of the correlations network are analyzed by forming modules.
[0081] In step 908, an in-depth analysis of selected modules using probabilistic graphical models is performed to identify a network of key biomarkers that represent the module.
[0082] In step 910, a cohort of subjects is partitioned into distinct health profiles with corresponding biomarker signatures. In various embodiments, the cohorts are clustered in order to partition them into distinct health profiles.
[0083] In step 912, personal history and longitudinal disease diagnosis data is optionally integrated for each subject in the cohort of subjects to strengthen the health profile of each subject.
[0084] In step 914, a metabolic syndrome risk is determined for each subject based on the health profile associated with each subject.
[0085] FIG. 10 is an illustration of a system for diagnosing metabolic syndrome in subjects, in accordance with various embodiments. As shown herein, system 1000 is comprised of a computing device/server 1004 that is in communications with a plurality of different modalities of data sources 1002.
[0086] The computing device/server 1004 can be configured to host a normalizer 1006, a concatenate engine 1008, a structure analyzer 1010, a graphical analyzer 1012, a clustering module 1014, an integrator 1016 and a risk assessor 1018.
[0087] The normalizer 1006 can be configured to normalize heterogeneous data features derived from the plurality of different modalities. In various embodiments, each modality comprises a plurality of data features. In various embodiments, the modalities include, but are not limited to: whole-genome sequencing (WGS), microbiome, global metabolome, laboratory analysis, magnetic resonance imaging (MRI), computed tomography (CT) scan, routine lab work, vitals and personal/family medical history.
[0088] The concatenate engine 1008 can be configured to identify statistically significant associations across the data features across each modality to identify correlations between the modalities to form a correlations network. The structure analyzer 1010 can be configured to analyze structures of the correlations network by forming modules. The graphical analyzer 1002 can be configured to perform in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module. The clustering module 1014 can be configured to partition a cohort of subjects into clusters of subjects with distinct health profiles with corresponding biomarker signatures. In various embodiments, the cohorts are clustered in order to partition them into distinct health profiles.
[0089] The integrator 1016 can be configured to optionally integrate each subject’s personal history and, further optionally integrate each subject’s longitudinal disease diagnosis data to strengthen the health profile of each subject. The risk assessor 1018 can be configured to assess a metabolic syndrome risk for each subject based on the health profile of each subject and send that to a display 1020 that this communicatively connected with the computing device/server 1004.
Computer Implemented System
[0090] FIG. 11 is a block diagram that illustrates a computer system 1100, upon which embodiments of the present teachings may be implemented. In various embodiments of the present teachings, computer system 1100 can include a bus 1102 or other communication mechanism for communicating information, and a processor 1104 coupled with bus 1102 for processing information. In various embodiments, computer system 1100 can also include a memory, which can be a random access memory (RAM) 1106 or other dynamic storage device, coupled to bus 1102 for determining instructions to be executed by processor 1104. Memory also can be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1104. In various embodiments, computer system 1100 can further include a read-only memory (ROM) 1108 or other static storage device coupled to bus 1102 for storing static information and instructions for processor 1104. A storage device 1110, such as a magnetic disk or optical disk, can be provided and coupled to bus 1102 for storing information and instructions.
[0091] In various embodiments, computer system 1100 can be coupled via bus 1102 to a display 1112, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. An input device 1114, including alphanumeric and other keys, can be coupled to bus 1102 for communicating information and command selections to processor 1104. Another type of user input device is a cursor control 1116, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1104 and for controlling cursor movement on display 1112. This input device 1114 typically has two degrees of freedom in two axes, a first axis (i.e., x) and a second axis (i.e., y), that allows the device to specify positions in a plane. However, it should be understood that input devices 1114 allowing for three-dimensional (x, y, and z) cursor movement are also contemplated herein.
[0092] Consistent with certain implementations of the present teachings, results can be provided by computer system 1100 in response to processor 1104 executing one or more sequences of one or more instructions contained in memory 1106. Such instructions can be read into memory 1106 from another computer-readable medium or computer-readable storage medium, such as storage device 1110. Execution of the sequences of instructions contained in memory 1106 can cause processor 1104 to perform the processes described herein. Alternatively, hard-wired circuitry can be used in place of or in combination with software instructions to implement the present teachings. Thus, implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.
[0093] The term“computer-readable medium” (e.g., data store, data storage, etc.) or“computer- readable storage medium” as used herein refers to any media that participates in providing instructions to processor 1104 for execution. Such a medium can take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Examples of non volatile media can include, but are not limited to, optical, solid state, and magnetic disks, such as storage device 1110. Examples of volatile media can include, but are not limited to, dynamic memory, such as memory 1106. Examples of transmission media can include, but are not limited to, coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 1102.
[0094] Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.
[0095] In addition to a computer-readable medium, instructions or data can be provided as signals on transmission media included in a communications apparatus or system to provide sequences of one or more instructions to processor 1104 of computer system 1100 for execution. For example, a communication apparatus may include a transceiver having signals indicative of instructions and data. The instructions and data are configured to cause one or more processors to implement the functions outlined in the disclosure herein. Representative examples of data communications transmission connections can include, but are not limited to, telephone modem connections, wide area networks (WAN), local area networks (LAN), infrared data connections, NFC connections, etc.
[0096] It should be appreciated that the methodologies described herein such as flow charts, diagrams, and the accompanying disclosure can be implemented using computer system 1000 as a standalone device or on a distributed network of shared computer processing resources such as a cloud computing network.
[0097] The methodologies described herein may be implemented by various means depending upon the application. For example, these methodologies may be implemented in hardware, firmware, software, or any combination thereof. For a hardware implementation, the processing unit may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro controllers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
[0098] In various embodiments, the methods of the present teachings may be implemented as firmware and/or a software program and applications written in conventional programming languages such as C, C++, Python, etc. If implemented as firmware and/or software, the embodiments described herein can be implemented on a non-transitory computer-readable medium in which a program is stored for causing a computer to perform the methods described above. It should be understood that the various engines described herein can be provided on a computer system, such as computer system 1000 of Appendix B, whereby processor 1004 would execute the analyses and determinations provided by these engines, subject to instructions provided by any one of, or a combination of, memory components 1106/1108/1110 and user input provided via input device 1014.
[0099] While the present teachings are described in conjunction with various embodiments, it is not intended that the present teachings be limited to such embodiments. On the contrary, the present teachings encompass various alternatives, modifications, and equivalents, as will be appreciated by those of skill in the art.
[00100] Further, in describing various embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. As one of ordinary skill in the art would appreciate, other sequences of steps may be possible. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. In addition, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the various embodiments.
[00101] The embodiments described herein can be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The embodiments can also be practiced in distributing computing environments where tasks are performed by remote processing devices that are linked through a network.
[00102] It should also be understood that the embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. These operations are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing.
[00103] Any of the operations that form part of the embodiments described herein are useful machine operations. The embodiments, described herein, also relate to a device or an apparatus for performing these operations. The systems and methods described herein can be specially constructed for the required purposes or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
[00104] Certain embodiments can also be embodied as computer-readable code on a computer- readable medium. The computer-readable medium is any data storage device that can store data, which can thereafter be read by a computer system. Examples of the computer-readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical, FLASH memory and non-optical data storage devices. The computer-readable medium can also be distributed over a network coupled to computer systems so that the computer-readable code is stored and executed in a distributed fashion. [00105] From the foregoing description, one skilled in the art can easily ascertain the essential characteristics of the systems and methods and, without departing from the spirit and scope thereof, can make various changes and modifications to adapt it to various usages and conditions.
[00106] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure, suitable methods and materials are described in the foregoing paragraphs. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting. In case of conflict, the present specification, including definitions, will control.
[00107] All United States patents and published or unpublished United States patent applications cited herein are incorporated by reference. All published foreign patents and patent applications cited herein are hereby incorporated by reference. All published references, documents, manuscripts, scientific literature cited herein are hereby incorporated by reference. All identifier and accession numbers pertaining to scientific databases referenced herein (e.g. , PUB MED, NCBI) are hereby incorporated by reference.

Claims

What is Claimed:
1. A method for diagnosing metabolic syndrome in a subject, comprising,
a) normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features;
b) identifying statistically significant associations across the data features across each modality to identify correlations between the modalities and form a correlations network;
c) analyzing structures of the correlations network by forming modules;
d) performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module;
e) partitioning a cohort of subjects into distinct health profiles with corresponding biomarker signatures;
f) optionally integrating personal history and, further optionally integrating the
longitudinal disease diagnosis data for each subject in the cohort of subjects to strengthen the health profile of each subject; and
g) determining a metabolic syndrome risk for each subject based on the health
profile associated with each subject.
2. The method of claim 1, wherein the modalities include whole-genome sequencing (WGS).
3. The method of claim 2, wherein the modalities include microbiome characterization.
4. The method of claim 3, wherein the modalities include global metabolome
characterization.
5. The method of claim 4, wherein the modalities include routine laboratory analysis.
6. The method of claim 5, wherein the modalities include magnetic resonance imaging (MRI) or computed tomography (CT) scans.
7. The method of claim 6, wherein the modalities include vitals and personal/family medical history.
8. The method of claim 1, wherein the modality includes laboratory analysis comprising lab- developed tests for insulin resistance and prediabetes.
9. The method of claim 1, further comprising an amalgamation of machine learning assisted analyses including:
identifying significant cross-modality associations;
forming modules of densely connected features to identify key biomarkers;
clustering each subject into distinct health risk groups with corresponding biomarker signatures; and
enriching longitudinal outcomes of each subject within each risk group.
10. The method of claim 1, wherein the cohort of subjects are clustered in order to partition them into distinct health profiles.
11. A computer readable medium comprising computer-executable instructions, which, when executed by a processor, cause the processor to carry out a method or a set of steps for diagnosing a metabolic syndrome in a subject, the method or the set of steps comprising, a) normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features;
b) identifying statistically significant associations across the data features across each modality to identify correlations between the modalities and form a correlations network;
c) analyzing structures of the correlations network by forming modules;
d) performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module;
e) partitioning a cohort of subjects into distinct health profiles with corresponding biomarker signatures;
f) optionally integrating personal history and, further optionally integrating the longitudinal disease diagnosis data for each subject in the cohort of subjects to strengthen the health profile of each subject; and
g) determining a metabolic syndrome risk for each subject based on the health
profile associated with each subject.
12. The computer readable medium of claim 11, wherein the modalities include whole-genome sequencing (WGS).
13. The computer readable medium of claim 12, wherein the modalities include microbiome characterization.
14. The computer readable medium of claim 13, wherein the modalities include global
metabolome characterization.
15. The computer readable medium of claim 14, wherein the modalities include routine laboratory analysis.
16. The computer readable medium of claim 15, wherein the modalities include magnetic resonance imaging (MRI) or computed tomography (CT) scans.
17. The computer readable medium of claim 16, wherein the modalities include vitals and personal/family medical history.
18. The computer readable medium of claim 11, wherein the modality includes laboratory analysis comprising lab-developed tests for insulin resistance and prediabetes.
19. The method of claim 11, further comprising an amalgamation of machine learning assisted analyses including:
identifying significant cross-modality associations;
forming modules of densely connected features to identify key biomarkers;
clustering each subject into distinct health risk groups with corresponding biomarker signatures; and
enriching longitudinal outcomes of each subject within each risk group.
20. The method of claim 11, wherein the cohort of subjects are clustered in order to partition them into distinct health profiles.
21. A system for diagnosis of a metabolic syndrome , comprising:
a) a normalizer for normalizing heterogeneous data features derived from a plurality of modalities, wherein each modality comprises a plurality of data features;
b) a concatenation engine for identifying statistically significant associations across the data features across each modality to identify correlations between the modalities to form a correlations network;
c) a structure analyzer for analyzing structures of the correlations network by
forming modules; d) a graphical analyzer for performing in-depth analysis of selected modules using probabilistic graphical models to identify a network of key biomarkers that represents the module;
e) a clustering module for partitioning a cohort of subjects into clusters of subjects each with distinct health profiles with corresponding biomarker signatures;
f) an integrator for optionally integrating each subject’s personal history and, further optionally integrating each subject’s longitudinal disease diagnosis data to strengthen the health profile of each subject; and
g) a risk assessor for determining a metabolic syndrome risk for each subject based on the health profile of each subject.
22. The system of claim 21, wherein the modalities include whole-genome sequencing (WGS).
23. The system of claim 22, wherein the modalities include microbiome characterization.
24. The system of claim 23, wherein the modalities include global metabolome
characterization.
25. The system of claim 24, wherein the modalities include routine laboratory analysis.
26. The system of claim 25, wherein the modalities include magnetic resonance imaging (MRI) or computed tomography (CT) scans.
27. The system of claim 26, wherein the modalities include vitals and personal/family medical history.
28. The system of claim 21, wherein the modality includes laboratory analysis comprising lab- developed tests for insulin resistance and prediabetes.
29. The system of claim 21, further comprising an amalgamation of machine learning assisted analyses including:
identifying significant cross-modality associations;
forming modules of densely connected features to identify key biomarkers;
clustering each subject into distinct health risk groups with corresponding biomarker signatures; and
enriching longitudinal outcomes of each subject within each risk group.
PCT/US2019/051193 2018-09-13 2019-09-13 Multimodal signatures and use thereof in the diagnosis and prognosis of diseases WO2020056389A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862731043P 2018-09-13 2018-09-13
US62/731,043 2018-09-13

Publications (1)

Publication Number Publication Date
WO2020056389A1 true WO2020056389A1 (en) 2020-03-19

Family

ID=68073214

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/051193 WO2020056389A1 (en) 2018-09-13 2019-09-13 Multimodal signatures and use thereof in the diagnosis and prognosis of diseases

Country Status (1)

Country Link
WO (1) WO2020056389A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006084132A2 (en) 2005-02-01 2006-08-10 Agencourt Bioscience Corp. Reagents, methods, and libraries for bead-based squencing
WO2017214068A1 (en) * 2016-06-05 2017-12-14 Berg Llc Systems and methods for patient stratification and identification of potential biomarkers

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006084132A2 (en) 2005-02-01 2006-08-10 Agencourt Bioscience Corp. Reagents, methods, and libraries for bead-based squencing
WO2017214068A1 (en) * 2016-06-05 2017-12-14 Berg Llc Systems and methods for patient stratification and identification of potential biomarkers

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
HAO DING: "VISUALIZATION AND INTEGRATIVE ANALYSIS OF CANCER MULTI-OMICS DATA", 1 January 2016 (2016-01-01), XP055506046, Retrieved from the Internet <URL:https://etd.ohiolink.edu/|etd.send_file?accession=osu1467843712&disposition=inline> [retrieved on 20180911] *
KILEY SCHMIDT GRAIM: "Learning from new perspectives: Using sparse data and multiple views to predict cancer progression and treatment Publication Date", 1 January 2016 (2016-01-01), XP055647147, Retrieved from the Internet <URL:https://escholarship.org/content/qt8fg3r15b/qt8fg3r15b.pdf> [retrieved on 20191127] *
MARINKA ZITNIK ET AL: "Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 June 2018 (2018-06-30), XP081239217 *
SAMBROOK ET AL.: "Molecular Cloning: A Laboratory Manual", 2000, COLD SPRING HARBOR LABORATORY PRESS
SOKOLOVSKA NATALIYA ET AL: "Deep Self-Organising Maps for efficient heterogeneous biomedical signatures extraction", 2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), IEEE, 24 July 2016 (2016-07-24), pages 5079 - 5086, XP032992835, DOI: 10.1109/IJCNN.2016.7727869 *
VLADIMIR GLIGORIJEVI: "Methods for Analysis and Integration of Heterogeneous Network Data", 1 June 2017 (2017-06-01), XP055647144, Retrieved from the Internet <URL:https://spiral.imperial.ac.uk/handle/10044/1/65802> [retrieved on 20191127] *

Similar Documents

Publication Publication Date Title
Jamshidi et al. Evaluation of cell-free DNA approaches for multi-cancer early detection
Schüssler-Fiorenza Rose et al. A longitudinal big data approach for precision health
US20200027557A1 (en) Multimodal modeling systems and methods for predicting and managing dementia risk for individuals
Beesley et al. The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities
DK2183693T5 (en) Diagnosis of fetal chromosomal aneuploidy using genome sequencing
Ritchie et al. A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets
WO2020198068A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
JP2013505730A (en) System and method for classifying patients
Konwar et al. Considerations when processing and interpreting genomics data of the placenta
Baron et al. Utilization of lymphoblastoid cell lines as a system for the molecular modeling of autism
JP2003021630A (en) Method of providing clinical diagnosing service
Verma et al. Current scope and challenges in phenome-wide association studies
WO2013063139A1 (en) Selection of preferred sample handling and processing protocol for identification of disease biomarkers and sample quality assessment
US20210102262A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
EP3126846A1 (en) Biomarkers and methods for measuring and monitoring juvenile idiopathic arthritis activity
WO2015191613A1 (en) Biomarkers and methods for measuring and monitoring axial spondyloarthritis disease activity
Evans et al. Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets
De Grandi et al. Highly Elevated Plasma γ‐Glutamyltransferase Elevations: A Trait Caused by γ‐Glutamyltransferase 1 Transmembrane Mutations
WO2020056389A1 (en) Multimodal signatures and use thereof in the diagnosis and prognosis of diseases
Schniering et al. Resolving phenotypic and prognostic differences in interstitial lung disease related to systemic sclerosis by computed tomography-based radiomics
Lauria Rank-based miRNA signatures for early cancer detection
Rosati et al. Differential gene expression analysis pipelines and bioinformatic tools for the identification of specific biomarkers: A Review
Shomorony et al. Unsupervised integration of multimodal dataset identifies novel signatures of health and disease
US20230005569A1 (en) Chromosomal and Sub-Chromosomal Copy Number Variation Detection
Yang et al. A machine learning model to characterize chronic kidney disease with metabolomics data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19779290

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19779290

Country of ref document: EP

Kind code of ref document: A1