WO2020242976A1 - Methods for diagnosis of polygenic diseases and phenotypes from genetic variation - Google Patents

Methods for diagnosis of polygenic diseases and phenotypes from genetic variation Download PDF

Info

Publication number
WO2020242976A1
WO2020242976A1 PCT/US2020/034303 US2020034303W WO2020242976A1 WO 2020242976 A1 WO2020242976 A1 WO 2020242976A1 US 2020034303 W US2020034303 W US 2020034303W WO 2020242976 A1 WO2020242976 A1 WO 2020242976A1
Authority
WO
WIPO (PCT)
Prior art keywords
disease
individual
correlates
ptv
risk
Prior art date
Application number
PCT/US2020/034303
Other languages
French (fr)
Inventor
Jonathan PRITCHARD
Manuel A. RIVAS
Nasa SINNOTT
Tanigawa YOSUKE
Original Assignee
The Board Of Trustees Of The Leland Stanford Junior University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Board Of Trustees Of The Leland Stanford Junior University filed Critical The Board Of Trustees Of The Leland Stanford Junior University
Publication of WO2020242976A1 publication Critical patent/WO2020242976A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • Methods, systems, and devices, including computer programs encoded on a computer storage medium are provided for predicting the risk of an individual developing a polygenic disease or medically relevant trait. Most common diseases are caused by dysregulation of multiple genes.
  • a predictive model is provided that estimates the risk of developing a disease or medically relevant condition by analyzing polygenic contributions to the disease and underlying changes in physical traits and clinically measured biomarkers.
  • a method of predicting the risk of an individual developing a polygenic disease or medically relevant trait comprising: a) providing a database comprising correlation data for associations between genetic variants and the disease or medically relevant trait based on genome-wide testing of a population for genetic variants associated with the disease or the medically relevant trait; b) genotyping the individual to determine if the individual has one or more of the genetic variants associated with the disease or the medically relevant phenotypic trait; c) calculating at least one polygenic risk score based on the genetic variants detected in the individual by genotyping, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait.
  • PRS polygenic risk score
  • the genetic variants are selected from the group consisting of protein-truncating variants (PTVs), protein-altering variants, non-coding variants, human leukocyte antigen (HLA) allelotypes, and copy number variations (CNVs).
  • PTVs protein-truncating variants
  • HLA human leukocyte antigen allelotypes
  • CNVs copy number variations
  • the individual has at least one protein truncating variant (PTV), copy number variation (CNV), or human leukocyte antigen (HLA) allele that correlates with a size-effect change in a measurement of at least one clinical biomarker in the individual compared to that of the clinical biomarker in a control subject having a wild-type allele.
  • PTV protein truncating variant
  • CNV copy number variation
  • HLA human leukocyte antigen
  • the individual has a plurality of variant alleles selected from Tables 5-10 and 13.
  • the individual has at least one HLA allele selected from Tables 8a and 8b.
  • the individual has at least one CNV selected from Tables 9 and 10.
  • the individual has at least one PTV selected from the group consisting of:
  • a PTV in APOB that correlates with decreased levels of LDL, apolipoprotein B or triglycerides
  • a PTV in GPT that correlates with decreased levels of alanine aminotransferase
  • a PTV in IQGAP2 and ALB that correlates with decreased levels of albumin
  • a PTV in GPLD1 and ALPL correlates with decreased levels of alkaline phosphatase
  • a PTV in ZNF229 that correlates with decreased levels of apolipoprotein B
  • a PTV in PDE3B that correlates with decreased levels of apolipoprotein B or triglycerides
  • a PTV in TNFRSF13B that correlates with decreased levels of non-albumin protein
  • a PTV in ANGPTL8 and LPL that correlates with decreased levels of triglycerides
  • a PTV in DRD5, PDZK1, or SLC22A12 that correlates with decreased levels of urate
  • a PTV in LIPC, PDE3B, and LPL that correlates with increased levels of apolipoprotein A or HDL
  • a PTV in FUT2 or RAP1GAP that correlates with increased levels of alkaline phosphatase
  • PTV in RNF186 or SLC22A2 that correlates with increased levels of creatinine
  • a PTV in SLC01B1 or UGT1A10 that correlates with increased levels of bilirubin
  • a PTV in RORC, SIGLEC1, or UPB1 that correlates with increased levels of gamma glutamyltransferase
  • At least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement.
  • the clinical biomarker is a serum or urine biomarker.
  • the clinical biomarker is selected from the group consisting of alanine aminotransferase, albumin, alkaline phosphatase, apolipoprotein A, apolipoprotein B, aspartate aminotransferase, calcium, cholesterol, c-reactive protein, creatinine, cystatin-C, direct bilirubin, gamma glutamyltransferase, glucose, glycated hemoglobin (HbA1c), HDL cholesterol, insulin-like growth factor 1 (IGF-1), low-density lipoprotein (LDL) direct, lipoprotein-A, phosphate, sex hormone binding globulin (SHBG), testosterone, total bilirubin, total protein, triglycerides, urate, urea, vitamin D, creatinine in urine, estimated glomerular filtration rate (eGFR), microalbumin in urine, potassium in urine, sodium in urine, non-albumin protein, urine albumin to creat
  • the method further comprises measuring the clinical biomarker in the individual.
  • At least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and the disease or the medically relevant trait including, for example, without limitation, type 2 diabetes, primary biliary cirrhosis, rheumatoid arthritis, schizophrenia, lupus, ulcerative colitis, sunburn, Crohn’s disease, allergy/eczema, hypothyroidism, age of menarche, age of menopause, systolic blood pressure, basophil percentage, eosinophil percentage, hematocrit, hemoglobin concentration, reticulocyte count, reticulocyte percentage, immature reticulocyte, fraction, lymphocyte count, lymphocyte percentage, mean corpuscular hemoglobin (MCH), MCH concentration, mean corpuscular volume (MCV), mean platelet thrombocyte volume (MPV), mean reticulocyte volume, mean sphered cell volume, monocyte count, monocyte percentage, neutrophil count, neutrophil percentage, platelet count,
  • the method further comprises adjusting at least one PRS for covariates including, for example, without limitation, age, sex, socioeconomic status, ethnicity, and anthropometric measurements.
  • the disease is myocardial infarction
  • the method comprises calculating at least one polygenic risk score for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement selected from tyrosine, glycoprotein acetyls, CH2 in fatty acids, arachidonic acid, pulse, sleep, vitamin D, urate, triglycerides, total protein, sodium in urine, phosphate, lipoprotein A, high density lipoprotein cholesterol, low density lipoprotein cholesterol, total cholesterol, ApoA, ApoB, Albumin, HbA1c, hemoglobin, diastolic blood pressure, CysC, proinsulin, glycoprotein, omega 6 fatty acid, macrophage colony stimulating factor, cutaneous T-cell-attracting chemokine, waist to hip ratio, fat mass, total protein, sleep hours, urate, sodium in urine, gamma glutamyltransferase, lymphocyte count, hand grip strength,
  • a clinical biomarker measurement
  • the disease is diabetes
  • the method comprises calculating at least one polygenic risk score for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement selected from waist to hip ratio, fat mass, waist circumference, pulse, sex hormone binding globulin, IGF1 , high density lipoprotein cholesterol, lipoprotein A, ApoA, alanine aminotransferase, Hip circumference, HbA1c, glucose, diastolic blood pressure, BMI, platelet derived growth factor, VEGF (vascular endothelial growth factor), total 20:0 long chain fatty acids, albumin, water intake, vitamin D, total bilirubin, testosterone, direct bilirubin, lymphocyte count, C-reactive protein, left hand grip strength, forced vital capacity, forced expiratory volume in 1 second, and total body fat, and various diabetes polygenic scores with and without adjustment for BMI.
  • the method further comprises measuring the clinical biomarker
  • a Spearman correlation is used to generate the correlation data.
  • the correlation data is selected from Tables 4-10 and 13.
  • At least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement, and at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and the disease or the medically relevant trait.
  • the method further comprises: a) generating a predictive model using one or more algorithms, wherein said predictive model is based on at least one PRS for the genetic association with a size effect on a clinical biomarker measurement and at least one PRS for the genetic association with the disease or the medically relevant trait; and b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS.
  • one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm.
  • a machine learning algorithm may be used including without limitation a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
  • the method further comprises treating the individual for the disease if the polygenic risk score indicates that the individual has the disease.
  • genotyping comprises sequencing at least part of a genome of one or more cells from the individual. In some embodiments, genotyping comprises sequencing the whole genome of the individual.
  • a database comprising correlation data between genetic variants and clinical biomarkers, diseases, and medically relevant traits, wherein the correlation data is selected from Tables 4-10 and 13.
  • a computer implemented method for predicting the risk of an individual developing a disease or medically relevant phenotypic trait comprising: a) receiving genome sequencing data for an individual; b) identifying variant alleles present in the individual from the genome sequencing data, wherein the individual has a plurality of variant alleles selected from Tables 5-10 and 13; c) calculating at least one polygenic risk score using a database, as described herein, based on the variant alleles present in the individual, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait; and d) displaying information regarding the risk of the individual developing the disease or the medically relevant trait.
  • PRS polygenic risk score
  • the computer implemented method further comprises: a) generating a predictive model using one or more algorithms, wherein the predictive model is based on at least one PRS for a genetic association with a size effect on a clinical biomarker measurement and at least one PRS for a genetic association with the disease or the medically relevant trait; and b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS.
  • one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm.
  • a machine learning algorithm may be used including without limitation a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
  • the computer implemented method further comprises storing the information regarding the risk of the individual developing the disease or the medically relevant phenotypic trait in a database.
  • a system for predicting the risk of an individual developing a disease or medically relevant trait using a computer implemented method described herein comprising: a) a storage component for storing data, wherein the storage component has instructions for predicting the risk of an individual developing a disease or medically relevant trait based on analysis of the genome sequencing data stored therein; b) a computer processor for processing the genome sequencing data using one or more algorithms, wherein the computer processor is coupled to the storage component and configured to execute the instructions stored in the storage component in order to receive the inputted genome sequencing data and analyze the data according to the computer implemented method described herein; and c) a display component for displaying the information regarding the risk of the individual developing the disease or the medically relevant trait.
  • a non-transitory computer-readable medium comprising program instructions that, when executed by a processor in a computer, causes the processor to perform a computer implemented method for predicting the risk of an individual developing a disease or medically relevant phenotypic trait, as described herein.
  • a kit comprising the non- transitory computer-readable medium and instructions for predicting the risk of an individual developing a disease or medically relevant trait is provided.
  • FIG. 1 shows a schematic overview of the study.
  • FIGS. 2A-2E show the genetics of lab phenotypes.
  • FIG. 2B Fraction of heritability per Chromosome across the 38 studied phenotypes. We obtained the chromosomal heritability by summing local heritability at loci within the chromosome. For each chromosome, we plot the boxplots of estimates at the 38 considered phenotypes.
  • FIG. 2D (x-axis) Polygenic heritability estimate for 38 lab phenotypes (y-axis) using LD-score regression. Estimate and standard error intervals shown.
  • FIG. 2E Enrichment of traits in different cell types. Definitions of tissue type groups are taken from Finucane et al. (Nat Genet. (2016) 50(4):621-629). Enrichments for all traits in each tissue are shown; the vast majority of enrichment across traits is in the liver and kidney, and the exceptions are highlighted.
  • FIGS. 3A-3B show a correlation of genetic effects and causal inference.
  • FIG. 3B MR-Egger and LCV predict causal links between lab measurements (blue nodes) and selected complex traits (red nodes). Associations arrows are drawn based on MR-Egger (red), LCV (blue), or both (black), and multiple arrows indicate support from multiple studies. MR-Egger and LCV were jointly adjusted for FDR 10% cutoff across all tests. Triangles are used for binary and circles for continuous summary statistics. Edge width is proportional to the absolute causal effect size, estimated by MR Egger. A complete listing of discovered associations is provided as a table (Table 15).
  • FIGS. 4A-4D show lab phenotype prediction from genetic data within and across populations.
  • FIG. 4A Increments in predictive performance with genetic data (change in correlation, R, or ROC-AUC) for White British (x-axis) and other ethnic groups (y-axis) are shown across the 38 lab phenotypes.
  • FIGS. 5A-5D show a polygenic Risk Score Phenome Wide Association Study (PRS- PheWAS).
  • FIG. 5A (x-axis) Biomarker polygenic risk scores at top 0.1% (top01) and top 1% (topi) and their association to different diseases in UK Biobank, represented as the odds ratio of the disease in this group relative to the middle 40-60% of individuals.
  • FIGS. 5B-5C (x axis) quantiles of polygenic risk score, spaced to linearly represent the mean of the corresponding bin of scores (y axis) Prevalence of disease (binary outcome) or average measurement (continuous outcome) within each quantile bin of the polygenic risk score. Error bars represent the standard error around each measurement.
  • FIG. 5A (x-axis) Biomarker polygenic risk scores at top 0.1% (top01) and top 1% (topi) and their association to different diseases in UK Biobank, represented as the odds ratio of the disease in this group relative to the middle 40-60% of individuals.
  • FIG. 6 shows the proportion of variance explained by all covariates across the 37 raw laboratory phenotypes (x-axis) Regression estimate of the proportion of variance explained by all 127 covariates in a linear model for 37 raw laboratory phenotypes including Fasting glucose defined if fasting time between 8 and 24 hours according to Data Field 74 in UK Biobank Data Showcase (y-axis). Blue bar plots indicate estimate before medication adjustment and red bar plots indicate estimate after medication adjustment. [0039] FIG.
  • FIG. 7A shows normalized regression coefficients for the 37 raw laboratory phenotypes across the covariates (x-axis) Normalized regression coefficient for 23 covariates in a linear model for the 37 raw laboratory phenotypes including Fasting glucose defined as fasting time between 8 and 24 hours according to Data Field 74 in UK Biobank Data Showcase (y-axis). Bar plots outlined in dark gray indicate estimate before medication adjustment and Bar plots outlined in light gray indicate estimate after medication adjustment.
  • FIG. 7B shows phenotype distributions of all biomarkers by age and sex.
  • Age of individuals within a pentacontile were averaged
  • y-axis The corresponding average value +/- 1 SD of each biomarker measurement for all individuals with available data in the study.
  • FIG. 7C shows residual distributions of all biomarkers by age and sex.
  • Age of individuals within a pentacontile were averaged
  • y-axis The corresponding average value +/- 1 SD of each biomarker residual for all individuals with available data in the study, after adjusting for the 127 covariates and intercept.
  • FIG. 8 shows the phenotype correlation among the 38 lab phenotypes. -1 (red) to 1 (blue) correlation of phenotypes (cell size indicates correlation). Only cells with p ⁇ 0.001 are shown. Results are consistent with previous work, and captures known associations between both testosterone and SHBG with uric acid (urate) levels 2.
  • FIG. 9 shows a correlogram of different diabetes- and diabetes-related traits.
  • type 2 diabetes followeding Eastwood et al
  • high confidence diabetes examining all available timepoints for an individual and using self-report and ICD codes
  • prescription of metformin or any oral antidiabetic are compared to the biomarker measurements of HbA1c and glucose.
  • HbA1c was adjusted for statins (see Methods) and residualized (see Methods), while glucose was subset to individuals with a fasting time between 8 and 24 hours (see Methods) to ensure effects were not driven by fasting.
  • Diagnosed diabetes was defined by the UK Biobank during the nurse interview, and family history was defined as having at least one self-reported mother, father, or sibling (non-adopted) with diabetes. Table of correlations presented below (Table 3).
  • FIGS. 10A-10B show comparisons of estimated effect sizes between UK Biobank and previous GWAS. (x-axis) UK Biobank estimated effect size (y-axis) Comparative study estimated effect size. All variants associated p ⁇ 1e-6 in either study are shown.
  • FIG. 10A shows plots for LDL vs. GLGC, HbA1c vs. MAGIC, and triglycerides vs. GLGC.
  • FIG. 10B shows plots for urate vs. GUGC and alanine aminotransferase vs. Biobank Japan.
  • FIGS. 11A-11 B show cascade plots for predicted protein-truncating variants across lab phenotypes (x-axis) Minor allele frequency of genetic variant associated to phenotype (p ⁇ 1e-7) and (y-axis) BETA univariate regression coefficient estimate.
  • Orange and labelled data points include genes with PTVs whose estimated effect size (BETA) is greater than or equal to.1 or less than or equal to -.1 standard deviation (SD).
  • BETA estimated effect size
  • SD standard deviation
  • Two phenotypes (Creatinine in urine and estradiol) did not have PTV associations with p ⁇ 1e-7 and excluded from the plot.
  • FIGS. 12A-12B show cascade plots for predicted protein-altering variants across lab phenotypes (x-axis) Minor allele frequency of genetic variant associated to phenotype (p ⁇ 1e-7) and (y-axis) BETA univariate regression coefficient estimate.
  • Light gray and labelled data points include genes with protein-altering variants whose estimated effect size (BETA) is greater than or equal to.1 or less than or equal to -.1.
  • FIGS. 13A-13C show cascade plots for non-coding variants across lab phenotypes (x-axis) Minor allele frequency of non-coding variants characterized on the imputed 1000 Genomes Phase I variant associated to phenotype (p ⁇ 5e-8) and (y-axis) BETA univariate regression coefficient estimate.
  • Orange and labelled data points include non-coding variants whose estimated effect size (BETA) is an outlier, l.e. absolute value of estimated effect size deviates from the standard deviation range estimated from linear fit between log minor allele frequency and absolute value of estimated effect size (outlier, see methods for more details).
  • BETA estimated effect size
  • the gene symbols are shown for splicing variants.
  • FIG. 14 shows posterior effect sizes, probabilities of Bayesian Model Averaging model inclusion, and linkage disequilibrium for HLA alleles on 29 different biomarker phenotypes y- axis indicates phenotype, and x -axis indicates allele. Above - the size of each dot corresponds to the posterior probability that the HLA allele is included as a variable across all plausible models as deemed by BIC measures from BMA, and the color of each dot corresponds to the size and direction of the effect of the allele on the phenotype as found by PLINK. Only the top 10 significant PLINK hits per phenotype were considered for the analysis. Below - LD measures (as determined and visualized by the gaston package) across HLA allelotypes; the measures displayed are R 2 values.
  • FIG. 15A shows CNV association analysis across the 38 biomarkers. X-axis Genomic coordinate and -log10(P) for single CNV association. CNV and biomarker association are highlighted when p ⁇ .05/10000 with cytogenic band labelled.
  • FIG. 15B shows PheWAS of rare CNVs affecting HNF1 B. X-axis log-odds ratio and -log10(P) for each trait having association with HNF1 B CNVs at p ⁇ 1e-4. Associations for all traits run as in previous analysis 3.
  • FIG. 16 shows cumulative heritability.
  • x-axis SNP ranked by heritability per SNP (millions) and its corresponding cumulative heritability (y-axis) across the 38 lab phenotypes.
  • Lab phenotype label shown in the title of the subplots.
  • FIG. 17A shows enrichment of traits in different cell types. Definitions of tissue type groups are taken from Finucane et al (Nat. Genet. (2016) 50:621-629). Enrichments for all traits in each tissue are shown; the vast majority of enrichment across traits is in the liver and kidney.
  • FIG. 17B shows grouped cell type heritability enrichments across ten tissues (x-axis, top) Fold enrichment with SE for each lab phenotype across 10 tissues (y-axis). (x-axis, bottom) -log10(P) value of enrichment or each lab phenotype across 10 tissues (y-axis).
  • FIG. 18 shows individual annotations for pancreas, liver, and kidney ChIP-seq experiments.
  • -log10(P) x-axis
  • y-axis for cell type heritability enrichment across pancreas, liver, and kidney ChIP-seq experiments.
  • FIG. 19 shows phenome-wide associations across 25 protein-truncating variants and laboratory measurements and 24 disease outcomes in the UK Biobank.
  • Targeted phenome- wide association analysis was performed for PTVs outside of the human MHC region that showed significant genome-wide associations (p ⁇ 1e-7) with at least one of the laboratory measurement traits.
  • the log odds ratio of the significant PheWAS associations (p ⁇ 1e-5) are shown across phenotypes (x-axis) and PTVs (y-axis).
  • the 46 significant (p ⁇ 1e-5) associations across 25 variants and 24 disease outcomes are shown as well as the associations with laboratory measurements.
  • the color of phenotype names indicate binary disease outcomes or family history (red) or laboratory measurements (purple).
  • FIG. 20 shows phenome-wide associations across 35 LD-independent protein-altering variants and 28 disease outcomes in the UK Biobank.
  • Targeted phenome-wide association analysis was performed for protein-altering variants outside of the human MHC region that showed significant genome-wide associations (p ⁇ 1e-7) with at least one of the laboratory measurement traits.
  • the log odds ratio of the significant PheWAS associations (p ⁇ 1e-5) are shown across phenotypes (x-axis) and protein-altering variants (y-axis).
  • p ⁇ 1e-5 are shown across phenotypes (x-axis) and protein-altering variants (y-axis).
  • Out of 172 significant (p ⁇ 1e-5) associations across 80 LD-independent protein-altering variants and 75 disease outcomes 35 variants and 28 disease outcomes with maximal number of significant associations are chosen for visualization.
  • the associations for those variant-phenotype pairs are shown as well as the associations across laboratory measurement phenotypes.
  • the color of phenotype names indicate binary disease outcomes or family history (red) and laboratory measurements (purple).
  • the color for log odds ratio or beta 0.2 is used for the associations with > 0.2 log odds ratio or beta.
  • FIG. 21 shows correlation of genetic effects between biomarkers. -1 (red) to 1 (blue) scale of correlation of genetic effects estimated using LD-score regression.
  • FIG. 22 shows correlation of genetic effects between biomarkers with normalization (“INT”), and with ipid-lowering therapy adjustment (“adjstatins”) and without. -1 (red) to 1 (blue) scale of correlation of genetic effects estimated using LD-score regression.
  • FIG. 23 shows correlation of genetic effects between normalized (“I NT”) lab phenotypes with lipid-lowering therapy adjustment (“adjstatins”) and without. -1 (red) to 1 (blue) scale of correlation of genetic effects estimated using LD-score regression.
  • FIG. 24A shows“Lake” plots of GWAS p-value and the magnitude of effect size estimates from snpnet for Lipoprotein A.
  • (x-axis) Genomic coordinates for (top panel) - log10(P) from GWAS and (bottom panel) absolute value of estimated effect size using snpnet (abs(BETA) from snpnet).
  • FIG. 24B shows“Lake” plots of GWAS p-value and the magnitude of effect size estimates from snpnet for LDL.
  • (x-axis) Genomic coordinates for (top panel) - log10(P) from GWAS and (bottom panel) absolute value of estimated effect size using snpnet (abs(BETA) from snpnet).
  • FIG. 24A shows“Lake” plots of GWAS p-value and the magnitude of effect size estimates from snpnet for Lipoprotein A.
  • (x-axis) Genomic coordinates for (top panel) - log10(
  • 24C shows“Lake” plots of GWAS p-value and the magnitude of effect size estimates from snpnet for Alanine Aminotransferase (x-axis) Genomic coordinates for (top panel) -log10(P) from GWAS and (bottom panel) absolute value of estimated effect size using snpnet (abs(BETA) from snpnet).
  • FIG. 25 shows lab phenotype prediction from genetic data within and across populations. The predictive performance with both genetic data and covariates (correlation, R) for White British (x-axis) and other ethnic groups (y-axis) are shown across the 38 lab phenotypes.
  • FIG. 26 shows an evaluation of the prevalence of type 2 diabetes based on precision polygenic risk scores for clinical laboratory tests of serum and urine, including lipids, hormones, and measures of kidney function.
  • Methods, systems, and devices, including computer programs encoded on a computer storage medium are provided for predicting the risk of an individual developing a polygenic disease or medically relevant trait.
  • methods are provided for using genetic information based on the detection of multiple genetic variants in an individual for diagnosing polygenic diseases, correlating phenotypic characteristics with genetic data, and predicting the risk of developing a disease or medically relevant condition by analyzing polygenic contributions to the disease and underlying changes in physical traits and clinically measured biomarkers.
  • sample with respect to an individual encompasses blood, urine, and other liquid samples of biological origin, solid tissue samples such as a biopsy specimen or tissue cultures or cells derived or isolated therefrom and the progeny thereof.
  • sample also includes samples that have been manipulated in any way after their procurement, such as by treatment with reagents; washed; or enrichment for certain cell populations, such as cancer cells.
  • samples that have been enriched for particular types of molecules, e.g., nucleic acids, polypeptides, etc.
  • DNA samples e.g. samples useful in genotyping, are readily obtained from any nucleated cells of an individual, e.g. hair follicles, cheek swabs, white blood cells, etc., as known in the art.
  • biological sample encompasses a clinical sample.
  • the types of“biological samples” include, but are not limited to: biological fluids, tissue samples, tissue obtained by surgical resection, tissue obtained by biopsy, cells in culture, cell supernatants, cell lysates, organs, bone marrow, blood, plasma, serum, saliva, urine, fine needle aspirate, lymph node aspirate, cystic aspirate, a paracentesis sample, a thoracentesis sample, and the like.
  • the term“assaying” is used herein to include the physical steps of manipulating a biological sample to generate data related to the sample.
  • a biological sample must be “obtained” prior to assaying the sample.
  • the term“assaying” implies that the sample has been obtained.
  • the terms“obtained” or“obtaining” as used herein encompass the act of receiving an extracted or isolated biological sample. For example, a testing facility can“obtain” a biological sample in the mail (or via delivery, etc.) prior to assaying the sample.
  • the biological sample was“extracted” or“isolated” from an individual by another party prior to mailing (i.e. , delivery, transfer, etc.), and then“obtained” by the testing facility upon arrival of the sample.
  • a testing facility can obtain the sample and then assay the sample, thereby producing data related to the sample.
  • the terms“obtained” or“obtaining” as used herein can also include the physical extraction or isolation of a biological sample from a subject. Accordingly, a biological sample can be isolated from a subject (and thus“obtained”) by the same person or same entity that subsequently assays the sample. When a biological sample is“extracted” or“isolated” from a first party or entity and then transferred (e.g., delivered, mailed, etc.) to a second party, the sample was“obtained” by the first party (and also“isolated” by the first party), and then subsequently “obtained” (but not“isolated”) by the second party.
  • the step of obtaining does not comprise the step of isolating a biological sample.
  • the step of obtaining comprises the step of isolating a biological sample (e.g., a pre-treatment biological sample, a post-treatment biological sample, etc.).
  • a biological sample e.g., a pre-treatment biological sample, a post-treatment biological sample, etc.
  • Methods and protocols for isolating various biological samples e.g., a blood sample, a serum sample, a plasma sample, a urine sample, a biopsy sample, an aspirate, etc.
  • any convenient method may be used to isolate a biological sample.
  • determining means determining whether the level of a clinical biomarker is less than or“greater than or equal to” a particular threshold, (the threshold can be pre-determined or can be determined by assaying a control sample).
  • “assaying to determine the level” can mean determining a quantitative value (using any convenient metric) that represents the level of a clinical biomarker.
  • treatment used herein to generally refer to obtaining a desired pharmacologic and/or physiologic effect.
  • the effect can be prophylactic in terms of completely or partially preventing a disease or symptom(s) thereof and/or may be therapeutic in terms of a partial or complete stabilization or cure for a disease and/or adverse effect attributable to the disease.
  • treatment encompasses any treatment of a disease in a mammal, particularly a human, and includes: (a) preventing the disease and/or symptom(s) from occurring in a subject who may be predisposed to the disease or symptom but has not yet been diagnosed as having it; (b) inhibiting the disease and/or symptom(s), i.e. , arresting their development; or (c) relieving the disease symptom(s), i.e., causing regression of the disease and/or symptom(s).
  • Those in need of treatment include those already inflicted (e.g., those with cancer, those with an infection, etc.) as well as those in which prevention is desired (e.g., those with increased susceptibility to cancer, those suspected of having cancer, etc.).
  • a therapeutic treatment is one in which the subject is inflicted prior to administration and a prophylactic treatment is one in which the subject is not inflicted prior to administration.
  • the subject has an increased likelihood of becoming inflicted or is suspected of being inflicted prior to treatment.
  • the subject is suspected of having an increased likelihood of becoming inflicted.
  • substantially purified generally refers to isolation of a substance (e.g., compound, molecule, agent) such that the substance comprises the majority percent of the sample in which it resides.
  • a substantially purified component comprises 50%, preferably 80%-85%, more preferably 90-95% of the sample.
  • isolated is meant an indicated cell, population of cells, or molecule is separate and discrete from a whole organism or is present in the substantial absence of other cells or biological macromolecules of the same type.
  • vertebrate any member of the subphylum chordata, including, without limitation, humans and other primates, including non-human primates such as chimpanzees and other apes and monkey species; farm animals such as cattle, sheep, pigs, goats and horses; domestic mammals such as dogs and cats; laboratory animals including rodents such as mice, rats and guinea pigs; birds, including domestic, wild and game birds such as chickens, turkeys and other gallinaceous birds, ducks, geese, and the like.
  • the term does not denote a particular age. Thus, both adult and newborn individuals are intended to be covered.
  • probe refers to a polynucleotide that contains a nucleic acid sequence complementary to a nucleic acid sequence present in the target nucleic acid analyte (e.g., at location of a mutation).
  • the polynucleotide regions of probes may be composed of DNA, and/or RNA, and/or synthetic nucleotide analogs.
  • Probes may be labeled in order to detect the target sequence. Such a label may be present at the 5’ end, at the 3’ end, at both the 5’ and 3’ ends, and/or internally.
  • An "allele-specific probe” hybridizes to only one of the possible alleles of a gene (e.g., hybridizes at the location of a mutation) under suitably stringent hybridization conditions.
  • primer refers to an oligonucleotide that hybridizes to the template strand of a nucleic acid and initiates synthesis of a nucleic acid strand complementary to the template strand when placed under conditions in which synthesis of a primer extension product is induced, i.e. , in the presence of nucleotides and a polymerization-inducing agent such as a DNA or RNA polymerase and at suitable temperature, pH, metal concentration, and salt concentration.
  • the primer is preferably single-stranded for maximum efficiency in amplification, but may alternatively be double-stranded. If double-stranded, the primer can first be treated to separate its strands before being used to prepare extension products.
  • a "primer” is complementary to a template, and complexes by hydrogen bonding or hybridization with the template to give a primer/template complex for initiation of synthesis by a polymerase, which is extended by the addition of covalently bonded bases linked at its 3' end complementary to the template in the process of DNA or RNA synthesis.
  • nucleic acids are amplified using at least one set of oligonucleotide primers comprising at least one forward primer and at least one reverse primer capable of hybridizing to regions of a nucleic acid flanking the portion of the nucleic acid to be amplified.
  • An "allele-specific primer” matches the sequence exactly of only one of the possible alleles of a gene (e.g., hybridizes at the location of a mutation), and amplifies only one specific allele if it is present in a nucleic acid amplification reaction.
  • common genetic variant or “common variant” refers to a genetic variant having a minor allele frequency (MAF) of greater than 5%.
  • rare genetic variant or “rare variant” refers to a genetic variant having a minor allele frequency (MAF) of less than or equal to 5%.
  • MAF minor allele frequency
  • Methods are provided for determining whether an individual is likely to develop a polygenic disease or medically relevant trait. Most common diseases are caused by dysregulation of multiple genes.
  • a predictive model is provided that estimates the risk of developing a disease or medically relevant condition by analyzing polygenic contributions to the disease and underlying changes in physical traits and clinically measured biomarkers.
  • the method typically involves genotyping an individual to identify genetic variants present in the genome that may be associated with a polygenic disease or medically relevant phenotypic trait, and using a database to calculate a polygenic risk score, wherein the database comprises correlation data for associations between genetic variants and diseases or medically relevant traits based on genome-wide testing of a population for genetic variants associated with the disease or the medically relevant trait.
  • the risk of an individual developing a disease or medically relevant trait is assessed from calculation of polygenic risk scores based on the genetic variants detected in the individual, as described further below (see Examples).
  • the methods described herein are useful for identifying individuals in need of close monitoring and treatment for a polygenic disease or medically relevant condition.
  • High risk individuals may be monitored more frequently for the development of symptoms of a polygenic disease, for example, by testing for disease relevant clinical biomarkers and changes in health status with prompt attention to any disease-relevant changes in health.
  • the methods are also of use for determining a therapeutic regimen or determining if a subject will benefit from treatment with a therapeutic regimen.
  • a subject identified as having a genetic predisposition to developing a polygenic disease or medically relevant condition may be treated in advance of developing symptoms of the disease to prevent physical damage that would be caused in the absence of treatment.
  • Such treatment may include, for example, without limitation, prescribing drugs that delay or minimize the risk of development of a disease, adjusting diet and/or levels of physical exercise, or administering gene therapy (e.g., modulating expression or activity of a gene or introducing a functional gene to compensate for the presence of a mutant allele having deficient or abnormal activity).
  • gene therapy e.g., modulating expression or activity of a gene or introducing a functional gene to compensate for the presence of a mutant allele having deficient or abnormal activity.
  • the methods described herein may be useful for confirming the diagnosis of a subject already showing symptoms of disease, who should be administered treatment for the disease.
  • the genetic variants detected may include common or rare genetic variants, such as mutations (e.g., nucleotide replacements, insertions, or deletions) and alterations of copy number.
  • the genetic variants are protein-truncating variants (PTVs), protein-altering variants, non-coding variants, single nucleotide variants, or human leukocyte antigen (HLA) allelotypes.
  • the genetic variants are associated with a known phenotype of interest (e.g., disease or condition).
  • a biological sample containing nucleic acids is collected from an individual.
  • the biological sample is typically saliva or cells from buccal swabbing, but can be any sample from bodily fluids, tissue or cells that contains genomic DNA or RNA of the individual.
  • nucleic acids from the biological sample are isolated, purified, and/or amplified prior to analysis using methods well-known in the art. See, e.g., Green and Sambrook Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Laboratory Press; 4 th edition, 2012); and Current Protocols in Molecular Biology ( Ausubel ed., John Wiley & Sons, 1995); herein incorporated by reference in their entireties.
  • Detection of a mutation can be direct or indirect.
  • the mutated gene itself can be detected directly.
  • the mutation can be detected indirectly from cDNAs, amplified RNAs or DNAs, or proteins expressed by a mutated allele. Any method that detects a base change in a nucleic acid sample or an amino acid change in a protein can be used.
  • allele-specific probes that specifically hybridize to a nucleic acid containing the mutated sequence can be used to detect the mutation.
  • a variety of nucleic acid hybridization formats are known to those skilled in the art. For example, common formats include sandwich assays and competition or displacement assays.
  • Hybridization techniques are generally described in Hames, and Higgins “Nucleic Acid Hybridization, A Practical Approach,” IRL Press (1985); Gall and Pardue, Proc. Natl. Acad. Sci. U.S.A., 63:378-383 (1969); and John et al Nature, 223:582-587 (1969).
  • Sandwich assays are commercially useful hybridization assays for detecting or isolating nucleic acids. Such assays utilize a "capture" nucleic acid covalently immobilized to a solid support and a labeled "signal" nucleic acid in solution. The clinical sample will provide the target nucleic acid. The "capture” nucleic acid and “signal” nucleic acid probe hybridize with the target nucleic acid to form a "sandwich” hybridization complex.
  • the allele-specific probe is a molecular beacon.
  • Molecular beacons are hairpin shaped oligonucleotides with an internally quenched fluorophore.
  • Molecular beacons typically comprise four parts: a loop of about 18-30 nucleotides, which is complementary to the target nucleic acid sequence; a stem formed by two oligonucleotide regions that are complementary to each other, each about 5 to 7 nucleotide residues in length, on either side of the loop; a fluorophore covalently attached to the 5' end of the molecular beacon, and a quencher covalently attached to the 3' end of the molecular beacon.
  • the quencher When the beacon is in its closed hairpin conformation, the quencher resides in proximity to the fluorophore, which results in quenching of the fluorescent emission from the fluorophore.
  • hybridization occurs resulting in the formation of a duplex between the target nucleic acid and the molecular beacon.
  • Hybridization disrupts intramolecular interactions in the stem of the molecular beacon and causes the fluorophore and the quencher of the molecular beacon to separate resulting in a fluorescent signal from the fluorophore that indicates the presence of the target nucleic acid sequence.
  • the molecular beacon is designed to only emit fluorescence when bound to a specific allele of a gene.
  • the molecular beacon probe encounters a target sequence with as little as one non-complementary nucleotide, the molecular beacon preferentially stay in its natural hairpin state and no fluorescence is observed because the fluorophore remains quenched.
  • detection of the mutated sequence is performed using allele- specific amplification.
  • amplification primers can be designed to bind to a portion of one of the disclosed genes, and the terminal base at the 3’ end is used to discriminate between the major and minor alleles or mutant and wild-type forms of the genes. If the terminal base matches the major or minor allele, polymerase-dependent three prime extension can proceed. Amplification products can be detected with specific probes. This method for detecting point mutations or polymorphisms is described in detail by Sommer et al. in Mayo Clin. Proc. 64:1361-1372 (1989).
  • Tetra-primer ARMS-PCR uses two pairs of primers that can amplify two alleles of a gene in one PCR reaction. Allele-specific primers are used that hybridize at the location of the mutated sequence, but each matches perfectly to only one of the possible alleles. If a given allele is present in the PCR reaction, the primer pair specific to that allele will amplify that allele, but not the other allele of the gene.
  • the two primer pairs for the different alleles may be designed such that their PCR products are of significantly different length, which allows them to be distinguished readily by gel electrophoresis. See, e.g., Munoz et al. (2009) J. Microbiol. Methods. 78(2):245-246 and Chiapparino et al. (2004) Genome. 47(2):414-420; herein incorporated by reference.
  • Mutations in a gene may also be detected by ligase chain reaction (LCR) or ligase detection reaction (LDR).
  • LCR ligase chain reaction
  • LDR ligase detection reaction
  • the specificity of the ligation reaction is used to discriminate between the major and minor alleles of a gene.
  • Two probes are hybridized at the site of the mutation in a nucleic acid of interest, whereby ligation can only occur if the probes are identical to the target sequence. See e.g., Psifidi et al. (2011) PLoS One 6(1):e14560; Asari et al. (2010) Mol. Cell. Probes. 24(6):381-386; Lowe et al. (2010) Anal Chem. 82(13):5810-5814; herein incorporated by reference.
  • an array comprising probes for detecting mutant alleles can be used.
  • SNP arrays are commercially available from Affymetrix and lllumina, which use multiple sets of short oligonucleotide probes for detecting known SNPs.
  • the design of SNP arrays, such as manufactured by Affymetrix or lllumina, is described further in LaFamboise, "Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances," Nuc. Acids Res. 37(13):4181-4193 (2009).
  • DASH PCR-dynamic allele specific hybridization
  • a target sequence is amplified (e.g., by PCR) using one biotinylated primer.
  • the biotinylated product strand is bound to a streptavidin-coated microtiter plate well (or other suitable surface), and the non-biotinylated strand is rinsed away with alkali wash solution.
  • An oligonucleotide probe specific for one allele (e.g., the wild-type allele), is hybridized to the target at low temperature.
  • This probe forms a duplex DNA region that interacts with a double strand-specific intercalating dye.
  • the dye When subsequently excited, the dye emits fluorescence proportional to the amount of double-stranded DNA (probe-target duplex) present.
  • the sample is then steadily heated while fluorescence is continually monitored. A rapid fall in fluorescence indicates the denaturing temperature of the probe-target duplex.
  • Tm melting temperature
  • a variety of other techniques can be used to detect mutations, including but not limited to, the Invader assay with Flap endonuclease (FEN), the Serial Invasive Signal Amplification Reaction (SISAR), the oligonucleotide ligase assay, restriction fragment length polymorphism (RFLP), single-strand conformation polymorphism, temperature gradient gel electrophoresis (TGGE), and denaturing high performance liquid chromatography (DHPLC).
  • FEN Invader assay with Flap endonuclease
  • SISAR Serial Invasive Signal Amplification Reaction
  • RFLP restriction fragment length polymorphism
  • TGGE temperature gradient gel electrophoresis
  • DPLC denaturing high performance liquid chromatography
  • the mutation can be identified indirectly by detection of the variant protein produced by the mutant allele.
  • Variant proteins i.e. , containing an amino acid substitution encoded by the mutant allele
  • immunoassays that can be used to detect variant proteins produced by mutant alleles include, but are not limited to, immunohistochemistry (IHC), western blotting, enzyme-linked immunosorbent assay (ELISA), radioimmunoassays (RIA), "sandwich” immunoassays, fluorescent immunoassays, and immunoprecipitation assays, the procedures of which are well known in the art (see, e.g., Schwarz et al.
  • a probe set is used, wherein the probe set comprises a plurality of allele-specific probes for detecting mutations in the subject's genome.
  • the probe set may comprise one or more allele-specific polynucleotide probes.
  • An allele-specific probe hybridizes to only one of the possible alleles of a gene under suitably stringent hybridization conditions.
  • Individual polynucleotide probes comprise a nucleotide sequence derived from the nucleotide sequence of the target mutated allele sequences or complementary sequences thereof.
  • the nucleotide sequence of the polynucleotide probe is designed such that it corresponds to, or is complementary to the target mutated allele sequences.
  • the allele-specific polynucleotide probe can specifically hybridize under either stringent or lowered stringency hybridization conditions to a region of the target mutated allele sequences, to the complement thereof, or to a nucleic acid sequence (such as a cDNA) derived therefrom.
  • the selection of the allele-specific polynucleotide probe sequences and determination of their uniqueness may be carried out in silico using techniques known in the art, for example, based on a BLASTN search of the polynucleotide sequence in question against gene sequence databases, such as the Human Genome Sequence, UniGene, dbEST or the non- redundant database at NCBI.
  • the allele-specific polynucleotide probe is complementary to the region of a single mutated allele target DNA or mRNA sequence.
  • Computer programs can also be employed to select allele-specific probe sequences that may not cross hybridize or may not hybridize non-specifically.
  • the allele-specific polynucleotide probes of the present invention may range in length from about 15 nucleotides to the full length of the coding target or non-coding target. In one embodiment of the invention, the polynucleotide probes are at least about 15 nucleotides in length. In another embodiment, the polynucleotide probes are at least about 20 nucleotides in length. In a further embodiment, the polynucleotide probes are at least about 25 nucleotides in length. In another embodiment, the polynucleotide probes are between about 15 nucleotides and about 500 nucleotides in length.
  • the polynucleotide probes are between about 15 nucleotides and about 450 nucleotides, about 15 nucleotides and about 400 nucleotides, about 15 nucleotides and about 350 nucleotides, about 15 nucleotides and about 300 nucleotides, about 15 nucleotides and about 250 nucleotides, about 15 nucleotides and about 200 nucleotides in length.
  • the probes are at least 15 nucleotides in length. In some embodiments, the probes are at least 15 nucleotides in length.
  • the probes are at least 20 nucleotides, at least 25 nucleotides, at least 50 nucleotides, at least 75 nucleotides, at least 100 nucleotides, at least 125 nucleotides, at least 150 nucleotides, at least 200 nucleotides, at least 225 nucleotides, at least 250 nucleotides, at least 275 nucleotides, at least 300 nucleotides, at least 325 nucleotides, at least 350 nucleotides, at least 375 nucleotides in length.
  • the allele-specific polynucleotide probes of a probe set can comprise RNA, DNA, RNA or DNA mimetics, or combinations thereof, and can be single-stranded or double-stranded.
  • the polynucleotide probes can be composed of naturally-occurring nucleobases, sugars and covalent internucleoside (backbone) linkages as well as polynucleotide probes having non-naturally-occurring portions which function similarly.
  • Such modified or substituted polynucleotide probes may provide desirable properties such as, for example, enhanced affinity for a target gene and increased stability.
  • the probe set may comprise a coding target and/or a non-coding target.
  • the probe set comprises a combination of a coding target and non-coding target.
  • a set of allele-specific primers is used, wherein the set of allele-specific primers comprises a plurality of allele-specific primers for detecting mutations in the subject's genome.
  • An allele-specific primer matches the sequence exactly of only one of the possible mutated alleles, hybridizes at the location of the mutation, and amplifies only one specific mutated allele if it is present in a nucleic acid amplification reaction.
  • a pair of primers can be used for detection of a mutated allele sequence.
  • Each primer is designed to hybridize selectively to a single allele at the site of the mutation in the gene under stringent conditions, particularly under conditions of high stringency, as known in the art.
  • the pairs of allele-specific primers are usually chosen so as to generate an amplification product of at least about 50 nucleotides, more usually at least about 100 nucleotides.
  • Algorithms for the selection of primer sequences are generally known, and are available in commercial software packages. These primers may be used in standard quantitative or qualitative PCR-based assays for SNP genotyping of subjects. Alternatively, these primers may be used in combination with probes, such as molecular beacons in amplifications using real-time PCR.
  • a label can optionally be attached to or incorporated into an allele-specific probe or primer polynucleotide to allow detection and/or quantitation of a target mutated allele sequence.
  • the target mutated polynucleotide may be from genomic DNA, expressed RNA, a cDNA copy thereof, or an amplification product derived therefrom, and may be the positive or negative strand, so long as it can be specifically detected in the assay being used.
  • an antibody may be labeled that detects a polypeptide expression product of the mutated allele.
  • labels used for detecting different mutant alleles may be distinguishable.
  • the label can be attached directly (e.g., via covalent linkage) or indirectly, e.g., via a bridging molecule or series of molecules (e.g., a molecule or complex that can bind to an assay component, or via members of a binding pair that can be incorporated into assay components, e.g. biotin-avidin or streptavidin).
  • Many labels are commercially available in activated forms which can readily be used for such conjugation (for example through amine acylation), or labels may be attached through known or determinable conjugation schemes, many of which are known in the art.
  • Detectable labels useful in the practice of the invention may include any molecule or substance capable of detection, including, but not limited to, fluorescers, chemiluminescers, chromophores, bioluminescent proteins, enzymes, enzyme substrates, enzyme cofactors, enzyme inhibitors, isotopic labels, semiconductor nanoparticles, dyes, metal ions, metal sols, ligands (e.g., biotin, streptavidin or haptens) and the like.
  • fluorescers chemiluminescers, chromophores, bioluminescent proteins, enzymes, enzyme substrates, enzyme cofactors, enzyme inhibitors, isotopic labels, semiconductor nanoparticles, dyes, metal ions, metal sols, ligands (e.g., biotin, streptavidin or haptens) and the like.
  • fluorescer refers to a substance or a portion thereof which is capable of exhibiting fluorescence in the detectable range.
  • Enzyme tags are used with their cognate substrate.
  • the terms also include chemiluminescent labels such as luminol, isoluminol, acridinium esters, and peroxyoxalate and bioluminescent proteins such as firefly luciferase, bacterial luciferase, Renilla luciferase, and aequorin.
  • the terms also include isotopic labels, including radioactive and non-radioactive isotopes, such as, 3 H, 2 H, 120 I, 123 l, 124 l, 125 l, 131 1, 35 S, 11 C, 13 C, 14 C, 32 P , 15 N, 13 N, 110 ln, 111 In, 177 Lu, 18 F, 52 Fe, 62 Cu, 64 Cu, 67 Cu, 67 Ga, 68 Ga, 86 Y, 90 Y, 89 Zr, 94m Tc, 94 Tc, 99m Tc, 154 Gd, 155 Gd, 156 Gd, 157 Gd, 158 Gd, 15 0, 186 Re, 188 Re, 51 M, 52m Mn, 55 Co, 72 As, 75 Br, 76 Br, 82m Rb, and 83 Sr.
  • radioactive and non-radioactive isotopes such as, 3 H, 2 H, 120 I, 123 l, 124 l, 125 l,
  • microspheres with xMAP technology produced by Luminex (Austin, TX)
  • microspheres containing quantum dot nanocrystals, for example, containing different ratios and combinations of quantum dot colors e.g., Qdot nanocrystals produced by Life Technologies (Carlsbad, CA)
  • glass coated metal nanoparticles see e.g., SERS nanotags produced by Nanoplex Technologies, Inc.
  • SonoVue microbubbles comprising sulfur hexafluoride
  • Optison microbubbles comprising an albumin shell and octafluoropropane gas core
  • Levovist microbubbles comprising a lipid/galactose shell and an air core
  • Perflexane lipid microspheres comprising perfluorocarbon microbubbles
  • Perflutren lipid microspheres comprising octafluoropropane encapsulated in an outer lipid shell
  • magnetic resonance imaging (MRI) contrast agents e.g., gadodiamide, gadobenic acid, gadopentetic acid, gadoteridol, gadofosveset, gadoversetamide, gadoxetic acid
  • radiocontrast agents such as for computed tomography (CT), radiography, or fluoroscopy (e.g., diatrizoic acid, metrizoic acid, iodamide, iotalamic acid,
  • Genotyping may also comprise sequencing nucleic acids from a sample collected from an individual using any convenient sequencing protocol.
  • Sequencing platforms that can be used include but are not limited to: pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, second-generation sequencing, nanopore sequencing, sequencing by ligation, or sequencing by hybridization.
  • Preferred sequencing platforms are those commercially available from lllumina (RNA-Seq) and Helicos (Digital Gene Expression or“DGE”).
  • “Next generation” sequencing methods include, but are not limited to those commercialized by: 1) 454/Roche Lifesciences including but not limited to the methods and apparatus described in Margulies et al., Nature (2005) 437:376-380 (2005); and US Patent Nos.
  • Massively parallel sequencing is described e.g. in US 5,695,934, entitled “Massively parallel sequencing of sorted polynucleotides,” and US 2010/0113283 A1 , entitled “Massively multiplexed sequencing.” Massively parallel sequencing typically involves obtaining DNA representing an entire genome, fragmenting it, and obtaining millions of random short sequences, which are assembled by mapping them to a reference genome sequence. Commercial services are available that are capable of genotyping approximately 1 million sequences for a fixed fee.
  • MassARRAY matrix-assisted laser desorption ionization time-of- flight mass spectrometry
  • MALDI-TOF MS matrix-assisted laser desorption ionization time-of- flight mass spectrometry
  • lllumina Golden Gate assay generates mutation-specific PCR products that are subsequently hybridized to beads either on a solid matrix or in solution.
  • oligonucleotides Three oligonucleotides are synthesized for each mutant: two allele specific oligonucleotides (ASOs) that distinguish the mutated sequence, and a locus specific sequence (LSO) just downstream of the mutation site.
  • ASOs allele specific oligonucleotides
  • LSO locus specific sequence
  • the ASO and LSO sequences also contain target sequences for a set of universal primers, while each LSO also contains a particular address sequences (the "illumicode") complementary to sequences attached to beads.
  • gene duplication or genomic copy number variation is detected.
  • 1 , 2, 3, 4, 5, or 6 or more copies of a polynucleotide sequence may be present in the genome of a subject.
  • Copy number variation can be calculated based on "relative copy number" so that apparent differences in gene copy numbers in different samples are not distorted by differences in sample amounts.
  • the relative copy number of a gene (per genome) can be expressed as the ratio of the copy number of a target gene to the copy number of a reference polynucleotide sequence in a DNA sample.
  • the reference polynucleotide sequence can be a sequence having a known genomic copy number. Typically the reference sequence will have a single genomic copy and is a sequence that is not likely to be amplified or deleted in the genome. It is not necessary to empirically determine the copy number of a reference sequence in each assay. Rather, the copy number may be assumed based on the normal copy number in the organism of interest.
  • one or more pattern recognition methods can be used in automating analysis of genetic data and generating a predictive model.
  • the predictive models and/or algorithms can be provided in a machine readable format and may be used to correlate genetic variants identified in a patient with a disease state, medically relevant trait, or a change in a clinical biomarker measurement.
  • Generating the predictive model may comprise, for example, the use of an algorithm or classifier.
  • a machine learning algorithm is used in generating the predictive model.
  • the machine learning algorithm may comprise a supervised learning algorithm.
  • supervised learning algorithms may include Average One- Dependence Estimators (AODE), Artificial neural network (e.g., Backpropagation), Bayesian statistics (e.g., Naive Bayes classifier, Bayesian network, Bayesian knowledge base), Case- based reasoning, Decision trees, Inductive logic programming, Gaussian process regression, Group method of data handling (GMDH), Learning Automata, Learning Vector Quantization, Minimum message length (decision trees, decision graphs, etc.), Lazy learning, Instance- based learning Nearest Neighbor Algorithm, Analogical modeling, Probably approximately correct learning (PAC) learning, Ripple down rules, a knowledge acquisition methodology, Symbolic machine learning algorithms, Subsymbolic machine learning algorithms, Support vector machines, Random Forests, Ensembles of classifiers, Bootstrap aggregating (bagging), and Boosting.
  • AODE Average One- Dependence Estimators
  • Supervised learning may comprise ordinal classification such as regression analysis and Information fuzzy networks (IFN).
  • supervised learning methods may comprise statistical classification, such as AODE, Linear classifiers (e.g., Fisher's linear discriminant, Logistic regression, Naive Bayes classifier, Perceptron, and Support vector machine), quadratic classifiers, k-nearest neighbor, Boosting, Decision trees (e.g., C4.5, Random forests), Bayesian networks, and Hidden Markov models.
  • the machine learning algorithm may also comprise an unsupervised learning algorithm.
  • unsupervised learning algorithms may include artificial neural network, Data clustering, Expectation-maximization algorithm, Self-organizing map, Radial basis function network, Vector Quantization, Generative topographic map, Information bottleneck method, and IBSEAD.
  • Unsupervised learning may also comprise association rule learning algorithms such as Apriori algorithm, Eclat algorithm and FP-growth algorithm.
  • Hierarchical clustering such as Single-linkage clustering and Conceptual clustering, may also be used.
  • unsupervised learning may comprise partitional clustering such as K-means algorithm and Fuzzy clustering.
  • the machine learning algorithms comprise a reinforcement learning algorithm.
  • reinforcement learning algorithms include, but are not limited to, temporal difference learning, Q-learning and Learning Automata.
  • the machine learning algorithm may comprise Data Pre-processing.
  • the machine learning algorithms include, but are not limited to, Average One-Dependence Estimators (AODE), Fisher's linear discriminant, Logistic regression, Perceptron, Multilayer Perceptron, Artificial Neural Networks, Support vector machines, Quadratic classifiers, Boosting, Decision trees, C4.5, Bayesian networks, Hidden Markov models, High-Dimensional Discriminant Analysis, and Gaussian Mixture Models.
  • the machine learning algorithm may comprise support vector machines, Naive Bayes classifier, k-nearest neighbor, high-dimensional discriminant analysis, or Gaussian mixture models.
  • the machine learning algorithm comprises Random Forests.
  • the predictive model is based on at least one polygenic risk score for a genetic association with a size effect on a clinical biomarker measurement and at least one polygenic risk score for a genetic association with a disease or medically relevant trait, wherein a combined risk score is calculated (see Examples).
  • Such combined polygenic risk scores generally better predict the risk of an individual developing the disease or the medically relevant trait than the separate polygenic risk scores.
  • the invention includes a computer implemented method for predicting the risk of an individual developing a polygenic disease or medically relevant trait.
  • the computer performs steps comprising a) receiving genome sequencing data for an individual; b) identifying variant alleles present in the genome of the individual from the genome sequencing data; c) calculating at least one polygenic risk score based on the variant alleles present in the individual using a database comprising correlation data for associations between genetic variants and diseases or medically relevant traits based on genome-wide testing of a population for genetic variants associated with the disease or the medically relevant trait, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait; and d) displaying information regarding the risk of the individual developing the disease or the medically relevant trait.
  • PRS polygenic risk score
  • the individual has a plurality of variant alleles selected from Tables 5-10 and 13.
  • the database comprises correlation data between genetic variants and clinical biomarkers, diseases, and medically relevant traits, wherein the correlation data is selected from Tables 4-10 and 13.
  • the computer implemented method further comprises: a) generating a predictive model using one or more algorithms, wherein the predictive model is based on at least one PRS for a genetic association with a size effect on a clinical biomarker measurement and at least one PRS for a genetic association with a disease or a medically relevant trait; and b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS.
  • one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm.
  • a machine learning algorithm may be used including without limitation a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
  • the computer implemented method further comprises storing the information regarding the risk of the individual developing the disease or the medically relevant phenotypic trait in a database.
  • the method can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware.
  • the disclosed and other embodiments can be implemented as one or more computer program products, i.e. , one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, a data processing apparatus.
  • the computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or any combination thereof.
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program does not necessarily correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • a system for performing the computer implemented method, as described includes a computer containing a processor, a storage component (i.e., memory), a display component, and other components typically present in general purpose computers.
  • the storage component stores information accessible by the processor, including instructions that may be executed by the processor and data that may be retrieved, manipulated or stored by the processor.
  • the storage component includes instructions.
  • the storage component includes instructions for predicting the risk of an individual developing a disease or medically relevant phenotypic trait based on analysis of genomic sequencing data stored therein.
  • the computer processor is coupled to the storage component and configured to execute the instructions stored in the storage component in order to receive genome sequencing data and analyze the data according to one or more algorithms, as described herein.
  • the display component displays information regarding the risk of the individual developing the disease or the medically relevant trait..
  • the storage component may be of any type capable of storing information accessible by the processor, such as a hard-drive, memory card, ROM, RAM, DVD, CD-ROM, USB Flash drive, write-capable, and read-only memories.
  • the processor may be any well-known processor, such as processors from Intel Corporation. Alternatively, the processor may be a dedicated controller such as an ASIC.
  • the instructions may be any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor.
  • the terms "instructions,” “steps” and “programs” may be used interchangeably herein.
  • the instructions may be stored in object code form for direct processing by the processor, or in any other computer language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance.
  • Data may be retrieved, stored or modified by the processor in accordance with the instructions.
  • the data may be stored in computer registers, in a relational database as a table having a plurality of different fields and records, XML documents, or flat files.
  • the data may also be formatted in any computer-readable format such as, but not limited to, binary values, ASCII or Unicode.
  • the data may comprise any information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories (including other network locations) or information which is used by a function to calculate the relevant data.
  • the processor and storage component may comprise multiple processors and storage components that may or may not be stored within the same physical housing.
  • some of the instructions and data may be stored on removable CD- ROM and others within a read-only computer chip. Some or all of the instructions and data may be stored in a location physically remote from, yet still accessible by, the processor.
  • the processor may comprise a collection of processors which may or may not operate in parallel. Kits
  • Kits are also provided for carrying out the methods described herein.
  • the kit comprises software for carrying out the computer implemented methods for predicting the risk of an individual developing a disease or medically relevant trait, as described herein.
  • the kit comprises a diagnostic system for predicting the risk of an individual developing a disease or medically relevant trait, as described herein.
  • the kit further comprises a container for collecting a DNA sample from an individual.
  • the kit may also include reagents for purifying and/or sequencing a DNA sample.
  • kits may further include (in certain embodiments) instructions for practicing the subject methods.
  • These instructions may be present in the subject kits in a variety of forms, one or more of which may be present in the kit.
  • instructions may be present as printed information on a suitable medium or substrate, e.g., a piece or pieces of paper on which the information is printed, in the packaging of the kit, in a package insert, and the like.
  • Another form of these instructions is a computer readable medium, e.g., diskette, compact disk (CD), flash drive, and the like, on which the information has been recorded.
  • Yet another form of these instructions that may be present is a website address which may be used via the internet to access the information at a removed site.
  • Type 2 diabetes is characterized by progressive loss of insulin sensitivity and is diagnosed through HbA1c, a modification to red blood cells induced by long term exposure to high serum glucose.
  • HbA1c a modification to red blood cells induced by long term exposure to high serum glucose.
  • PTVs protein truncating variants
  • PTVs outside MHC region have large estimated lowering effects (>0.1 sd) across at least one of the biomarkers, including: three PTVs in APOB with a range of strong effects on LDL (1.9-3.4 sd), Apolipoprotein B (2.2-2.8 sd), and triglycerides (1.3 sd); two PTVs in GPT with strong effects on alanine aminotransferase (>1.35 sd); a PTV in IQGAP2 and ALB with strong effects on albumin (>0.27 sd); three PTVs in GPLD1 (>0.85 sd) and a PTV in ALPL with effects on alkaline phosphatase (2.35 sd); PTVs in APOA5 (0.40 and 0.56 sd), CHFT8 (0.41 and 0.40 sd), and LCAT (1.34 and 1.48 sd) with effects on Apolipoprotein
  • PTVs outside MHC region have large estimated raising effects (>0.1 sd, Table 5) across at least one of the biomarkers, including: PTV in LIPC, PDE3B, and LPL with effects on Apolipoprotein A and HDL (> 0.18 sd); PTVs in FUT2 and RAP1 GAP with effects on alkaline phosphatase (0.12 sd); PTV in ABCG8 with effect on cholesterol (0.21 sd); PTVs in RNF186 and SLC22A2 with effect on creatinine (0.35 and 0.50 sd, respectively); PTVs in SLC01B1, UGT1A10 with effects on direct and total Bilirubin (0.37, 0.34, 0.40 sd, respectively); PTVs in RORC, SIGLEC1, and UPB1 with effects on gamma glutamyltransferase (0.21 , 0.11 , and 0.32 sd, respectively
  • the human leukocyte antigen (HLA) region of the genome is one of the most polymorphic and gene-dense regions of the human genome, with on the order of thousands of alleles for any given gene in the region 16 ' 17 .
  • HLA human leukocyte antigen
  • CNV Copy number variations
  • genotype is an indicator variable for an individual having a rare CNV (AF ⁇ 0.1 %) overlapping within 10kb of the gene region as defined by HGNC, for 23,598 genes.
  • HGNC HGNC
  • HNF1B is a membrane bound transcription factor part of the family of hepatocyte nuclear factors, believed to play a role in nephronal (renal) and pancreatic development. Previous studies have associated mutations in HNF1B with maturity onset diabetes of the young (MODY) and altered kidney function 22 .
  • GGT5 is key to glutathione homeostasis because it provides substrates for glutathione synthesis 25 .
  • CST3 encodes the Cystatin-C protein, which belongs to the type II cystatin gene family and is a potent inhibitor of lysosomal proteinases 26 .
  • LD-score regression 27 To characterize the heritability of the 38 lab phenotypes we first applied LD-score regression 27 . We further applied the Heritability Estimator from Summary Statistics (HESS), an approach for estimating the phenotype variances explained by all typed SNPs at a single locus in the genome while accounting for LD among the SNPs 28 ⁇ 29 . We find that both LD-score regression and HESS find that common SNPs explain a large fraction of the heritability (0.38% to 18.49% across the studied phenotypes, FIG. 2D). We compare the polygenicity of all 38 lab phenotypes by computing the fraction of total SNP heritability attributable to loci by the top 1% of SNPs.
  • HESS Heritability Estimator from Summary Statistics
  • liver fat percentage a quantitative measure derived from costly MRI images of the liver. Liver fat is driven by a combination of alcohol use and metabolic disorder 40 . Only 4,617 individuals thus far have quantified LFP in UK Biobank.
  • Predictive models including polygenic risk scores for biomarkers in addition to trait PRS highlight the potential that exists in deriving joint predictive models based on training on multiple responses, which we anticipate will improve resolution in dissecting drivers of disease risk in an individual. Integration with independent population biobanks will help elucidate the extent to which these combined risk models can be transferred.
  • Estradiol higher than 212 pmol/L We treated individuals beyond the detection limit for those laboratory measurements as cases in those four binary phenotypes, and below the detection limit as controls, as reported by the corresponding reportability fields.
  • statins 1141146234, atorvastatin; 1141192414, crestor 10mg tablet; 1140910632, eptastatin; 1140888594, fluvastatin; 1140864592, lescol 20mg capsule; 1141146138, lipitor 10mg tablet; 1140861970, lipostat 10mg tablet; 1140888648, pravastatin; 1141192410, rosuvastatin; 1141188146, simvador 10mg tablet; 1140861958, simvastatin; 1140881748, zocor 10mg tablet; 1141200040, zocor heart-pro 10mg tablet).
  • statins were identified in the UK Biobank for the purposes of adjusting by the estimated factor: 1140861958, simvastatin; 1140888594, fluvastatin; 1140888648, pravastatin; 1141146234, atorvastatin; 1141192410, rosuvastatin; 1140861922, lipid lowering drug; 1141146138, lipitor 10mg tablet.
  • --glm cols chrom,pos, ref, alt, altfreq, firth, test, nobs, orbeta,se,ci,t,p hide-covar --pgen ⁇ imputed PGEN> --remove ⁇ non-White British individuals» -keep ⁇ all individuals, males, or females» -geno 0.1 -hwe 1e-50 midp;
  • the HLA data from the UK Biobank contains all HLA loci (one line per person) in a specific order (A, B, C, DRB5, DRB4, DRB3, DRB1 , DQB1 , DQA1 , DPB1 , DPA1).
  • HLA:IMP*2 program Resource 182 - CITE
  • the Biobank reports one value per imputed allele, and only the best-guess alleles are reported.
  • Bayesian Model Averaging is a model selection method that trains a variety of models, one on each possible subset of alleles. The posterior probability of each model being the correct one given the data is determined, and subsequently, a BIC per model is calculated. The degree to which an allele is included across models (posterior probability) is then deemed a measure of confidence in the association between allele and phenotype.
  • CNVs were called by applying PennCNV v1.0.4 on raw signal intensity data from each array within each genotyping batch as previously described 20 , with the notable difference that here, all analyses are conducted within the white British unrelated cohort described above.
  • Data for phenome-wide associations were derived from UK Biobank data fields corresponding to body measurements, biomarkers, disease diagnoses, and medical procedures from medical records, as well as a questionnaire about lifestyle and medical history. Methods for CNV GWAS and burden testing are as previously described.
  • stage 1 fitting 28 We performed standard stage 1 fitting 28 , then removed all regions which contained no SNPs with MAF > 5% (5/-1700 bins genome wide) and generated stage 2 estimates from the resulting matrices. We used the same munged sumstats described above. We confirmed heritability estimates of select associations using GCTA-GREML and genotyped array variants on a subset of individuals (data not shown) to ensure estimates were comparable to this model.
  • Protein-truncating variants with at least one significant associations (P ⁇ 1e-7) with the
  • GBE Global Biobank Engine
  • Mendelian randomization methods enable estimation of causal effects between an exposure X and an outcome Y.
  • Given a set of genetic instruments of X i.e. , direct causes of X that are not affected by confounders
  • the causal effect of X on Y can be extracted by analyzing their associations with both X and Y.
  • Most methods are based on linear models and start with a 2D plot of the association summary statistics.
  • a meta-analysis is then used to estimate if there is a significant correlation between the effects, which then translates into a line whose slope reveals the causal effect.
  • MR-Egger is a powerful method that uses Egger regression for the meta-analysis 5 .
  • Egger regression was developed originally for correcting publication bias in meta-analyses, but the problem is analogous to adjusting bias from pleiotropy in the MR setting.
  • Egger-regression provides a way to both estimate and adjust for biases in the 2D plot that originate from pleiotropic effects (under the assumption that the association of each genetic instrument with the exposure is independent of the pleiotropic effect of the variant).
  • LCV is a recent method that makes use of the MR graphical model to evaluate if an observed genetic correlation can be attributed to a causal relationship 11 .
  • LCV is based on a 2D analysis of summary statistics as in MR methods, with two notable differences. First, it uses a latent variable to model the mediation of genetic correlation between two traits. This allows for the estimation of the full or partial proportion of genetic causal relation between two traits. Second, it takes as input all summary statistics and does not require a set of independent instruments. On the other hand, unlike MR methods, LCV does not address reverse causality, and it does not estimate causal effect sizes.
  • AnyAntidiabetic is defined as any non-insulin drug from the oral antidiabetics and metformin codes presented in Eastwood et al; T2D is the definition of type 2 diabetes presented in Eastwood et al; fasting glucose is the glucose measurement for the individuals with a self-reported fasting time between 8 and 24 hours; HighConfDiabetes is a combination of self-report and ICD codes presented in (DeBoever et al.
  • GenericMetformin is just using Metformin and its generic forms
  • FamilyHistoryDiabetes is defined as 0 or 1 depending on whether the individual has self- reported a father, mother, or sibling with diabetes
  • HbAI c.diabetic is defined as a binary indicator of the individual having a
  • the non-coding variants characterized on the imputed 1000 Genomes Phase I variants (ID, variant), their positions in centimorgans (CM) and its association to the lab phenotype (trait). Effect size allele (A1 ), estimated effect size (BETA), standard error (SE), p-value of association (P), minor allele frequency (MAF), whether the variant is outside of MHC region (is_outside_of_MHC), gene symbol (Gene Symbol), and absolute value of estimated effect size deviates from the standard deviation range estimated from linear fit between log minor allele frequency and absolute value of estimated effect size (outlier, see methods for more details).
  • Tables enumerate associations’ BETA, SE, T/Z ST AT values (depending on the type of test), P values from PUNK, and the same P values that have been Benjamini-Yekutieli adjusted (BY_ADJ_P).
  • Table 9 Copy number variation associated to the 38 lab phenotypes. Bonferroni p ⁇ 0.05/10000. Columns in the provided data file correspond to the phenotype, chromosome and centroid position of each CNV tested, CNV ID (formatted as chrom:bp1-bp2_del/dup (del denoted by - and dup by +), reference copy number (always N), alternate CNV (always denoted by +), tested“allele” (usually +), genotype model (ADD is additive), N, estimated beta/log odds ratio, standard error of estimate, t/z-statistic, and p-value.
  • CNV ID formatted as chrom:bp1-bp2_del/dup (del denoted by - and dup by +
  • reference copy number always N
  • alternate CNV always denoted by +
  • tested“allele” usually +
  • genotype model ADD is additive
  • N estimated beta/log odds ratio
  • standard error of estimate
  • Bonferroni p ⁇ .01/25000 Columns in the provided data file correspond to the phenotype, chromosome and centroid position of each gene tested, gene name, reference copy number (always N), burden of CNV (always denoted by +), tested“allele” (usually +), genotype model (ADD is additive), N, estimated beta/log odds ratio, standard error of estimate, t/z-statistic, and p-value.
  • Genomic region enrichment analysis tool applied to summary statistic data from 38 lab phenotypes and the mouse genome informatics (MGI) phenotype ontology.
  • the lab test phenotype (Trait), the enriched mouse phenotype ontology term (OntolotyJermJD, Ontoloty_term), its rank (Rank), -log10(GREAT binomial test P-value) (loglOBPval), the fold change in the GREAT binomial test (BFold), and the link to the Mouse Genome Informatics website for the enriched ontology term (MGMJRL).
  • the variant and their ID (Variant, VariantJD) and its association to disease outcomes (Phenotype) with the corresponding Global Biobank Engine phenotype ID (GBEJD).
  • the -Iog10 p-value of association (loglOP), estimated effect size (log odds ratio, LOR), standard error of effect size estimate (SE), Gene Symbol (Gene_symboi), predicted protein-truncating or protein-altering variant (Csq), predicted major consequence (Consequence), whether the variant is outside of MHC region (is__outside_of_MHC), whether the variant is LD independent based on LD pruning (Idjndep), and the URLs for the corresponding pages on Global Biobank Engine (GBE_ variant_page and GBEjohenotypejcage).
  • Autism disease Autism Table 15 Causal inference results using MR-Egger and LCV.
  • Each row represents a significant exposure-outcome pair by either MR-Egger or LCV (FDR 10%).
  • the edge type marks if the causal link was found by MR-Egger only, LCV only, or both. Estimated causal effects are presented for all pairs.
  • the laboratory phenotype (Phenotype), whether the phenotype is binary (bin) or quantitative (qt), evaluated population (population), the increments of predictive performance (AUC for binary traits and R for quantitative traits) from covariate-only model to the model with both covariates and genotypes (delta_R_or_AUC), predictive performance measures of the model with genotype and covariates (Genotype_and_covariates), the model with covariates (Covariates_only), and the model with genotypes (Genotype_only), and their trans- populational comparison with respect to White-British population shown in percent (Relative_to_WB_delta_R_or_AUC, Relative_to_WB_Genotype_and_covariates, Relative_to_WB_Covariates_only, and Relative_to_WB_Genotype_only).
  • Table 17 Population-specific bias in polygenic prediction of the 38 lab phenotypes.
  • the rank of the increments in predictive performance comparing the PRS model with both genotype and covariates and covariate alone across 5 population groups are summarized. The sum across population for a given rank varies due to the ties in the ranks.
  • Table 18 Predictive power of multiple regression of laboratory tests. Each trait is treated independently and a regression model (linear or logistic, determined by outcome) is used. McFadden’s adjusted R A 2 (for binary outcomes) and Adjusted R A 2 (for continuous outcomes) are presented for models which contain just covariates or covariates with the traits of interest. All regressions were run with age, sex, genotyping array, 40 principal components of the genotyping matrix, age squared, townsend deprivation index, and age-sex interaction. Type 2 diabetes additionally had covariates of BMI and Waist to Hip ratio and interactions of each with age and sex, and liver fat percentage has covariates of alcohol and interactions with age and sex.
  • Table 20 Regression coefficients for prediction of liver fat percentage. Regression coefficient terms and their standard errors estimated from individual liver fat percentage. All terms included in the full regression model are present in the table.
  • Biobank Biomarker Project Companion Document to Accompany Serum Biomarker Data. UK Biobank Document Showcase (2019).
  • NIDDK Quick Reference on UACR & GFR In Evaluating Patients with Diabetes for
  • Kidney Disease. NIDDK (03/2012). Available at: niddk.nih.gov/health- information/professionals/clinical-tools-patient-education-outreach/quick-reference-uacr-gfr. (Accessed: 19th April 2019)
  • NCBI NCBI. Available at: ncbi.nlm.nih.gov/pubmed/19414839. (Accessed: 6th May 2019)
  • PRSs polygenic risk scores
  • Our polygenic risk score database includes both publicly available, published results of genetic association studies, as well as novel datasets generated specifically for this purpose. This includes quantitative measures of health, including lipid measurements, glucose and HbA1c measurements, creatinine in serum and urine, cystatin C, potassium, and other proteins, metabolites, and elements; physiological measures such as pulse rate, blood pressure, EKG test results, blood oxygen, and other quantitative and quantized measurements of overall body state; anthropometries such as height, weight, BMI, fat mass, waist circumference, lung capacity, grip strength, gait, and related indicators of physical state and ability; direct tests of derived cell lines, extracted samples, or other biological materials for proliferation, quantity, gene expression, protein expression, telomere length, methylation state, mitochondrial DNA content, organelle morphology, cellular morphology, chromatin structure, chromatin state, response to perturbation, response to stimulation, or any other specific assay of interest in the given sample or for the given disease or any of its comorbidities, risk factors or associated
  • the predictive model provides a way of aggregating information from multiple PRSs to maximize performance on a single target trait.
  • the simplest version of this consists of predicting individual phenotypes using a regression model to weight information from the different polygenic scores. This can also be done using more advanced machine learning methods to aggregate information, including through random forest or deep neural network approaches.
  • Model fitting can consist of multiple stages, in which model selection is done in the presence of adjusting covariates, or meta-analyzed across multiple studies or in different temporal collections of the same study; within an individual, in which multiple measurements are used along with information about the individual’s state (e.g. drugs taken, major surgeries undergone, etc.); or in proxies of individuals, such as their relatives or geographically, socially, economically, and/or behaviorally similar individuals are aggregated to provide an estimate of effects of each polygenic score on the outcomes of a given person.

Abstract

Methods, systems, and devices, including computer programs encoded on a computer storage medium are provided for predicting the risk of an individual developing a polygenic disease or medically relevant trait. Most common diseases are caused by dysregulation of multiple genes. In particular, methods are provided for using genetic information based on the detection of multiple genetic variants in an individual for diagnosing polygenic diseases, correlating phenotypic characteristics with genetic data, and predicting the risk of developing a disease or medically relevant condition by analyzing polygenic contributions to the disease and underlying changes in physical traits and clinically measured biomarkers.

Description

METHODS FOR DIAGNOSIS OF POLYGENIC DISEASES AND PHENOTYPES FROM
GENETIC VARIATION
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0001] This invention was made with government support under contract HG008140 awarded by the National Institutes of Health. The government has certain rights in the invention.
BACKGROUND
[0002] Prediction of polygenic diseases and phenotypes from genetic variation is an important problem in many contexts, including clinical genetics, direct-to-consumer applications, and in agriculture. However, this remains a challenging problem, in large part because genetic risk is determined by very large numbers of genetic variants, each with small effects where the size of the effects are difficult to estimate.
SUMMARY
[0003] Methods, systems, and devices, including computer programs encoded on a computer storage medium are provided for predicting the risk of an individual developing a polygenic disease or medically relevant trait. Most common diseases are caused by dysregulation of multiple genes. A predictive model is provided that estimates the risk of developing a disease or medically relevant condition by analyzing polygenic contributions to the disease and underlying changes in physical traits and clinically measured biomarkers.
[0004] In one aspect, a method of predicting the risk of an individual developing a polygenic disease or medically relevant trait is provided, the method comprising: a) providing a database comprising correlation data for associations between genetic variants and the disease or medically relevant trait based on genome-wide testing of a population for genetic variants associated with the disease or the medically relevant trait; b) genotyping the individual to determine if the individual has one or more of the genetic variants associated with the disease or the medically relevant phenotypic trait; c) calculating at least one polygenic risk score based on the genetic variants detected in the individual by genotyping, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait.
[0005] In certain embodiments, the genetic variants are selected from the group consisting of protein-truncating variants (PTVs), protein-altering variants, non-coding variants, human leukocyte antigen (HLA) allelotypes, and copy number variations (CNVs).
[0006] In certain embodiments, the individual has at least one protein truncating variant (PTV), copy number variation (CNV), or human leukocyte antigen (HLA) allele that correlates with a size-effect change in a measurement of at least one clinical biomarker in the individual compared to that of the clinical biomarker in a control subject having a wild-type allele.
[0007] In certain embodiments, the individual has a plurality of variant alleles selected from Tables 5-10 and 13.
[0008] In certain embodiments, the individual has at least one HLA allele selected from Tables 8a and 8b.
[0009] In certain embodiments, the individual has at least one CNV selected from Tables 9 and 10.
[0010] In certain embodiments, the individual has at least one PTV selected from the group consisting of:
a) a PTV in APOB that correlates with decreased levels of LDL, apolipoprotein B or triglycerides;
b) a PTV in GPT that correlates with decreased levels of alanine aminotransferase; c) a PTV in IQGAP2 and ALB that correlates with decreased levels of albumin;
d) a PTV in GPLD1 and ALPL correlates with decreased levels of alkaline phosphatase;
e) a PTV in APOA5, CHFT8, and LCAT that correlates with decreased levels of apolipoprotein A and HDL;
f) a PTV in ZNF229 that correlates with decreased levels of apolipoprotein B; g) a PTV in PDE3B that correlates with decreased levels of apolipoprotein B or triglycerides;
h) a PTV in CST3 that correlates with decreased levels of cystatin C or triglycerides; a PTV in SAG that correlates with decreased levels of bilirubin;
i) a PTV in SLC22A2 and RNF186 that correlates with decreased levels of estimated glomerular filtration rate (eGFR);
j) a PTV in RHAG and G6PC2 that correlates with decreased levels of glucose or
HbA1c;
k) a PTV in MSR1 that correlates with decreased levels of IGF1 ;
L) a PTV in LPA that correlates with decreased levels of lipoprotein A;
m) a PTV in TNFRSF13B that correlates with decreased levels of non-albumin protein; n) a PTV in ANGPTL8 and LPL that correlates with decreased levels of triglycerides; o) a PTV in DRD5, PDZK1, or SLC22A12 that correlates with decreased levels of urate;
p) a PTV in INSC that correlates with decreased levels of vitamin D;
q) a PTV in LIPC, PDE3B, and LPL that correlates with increased levels of apolipoprotein A or HDL; r) a PTV in FUT2 or RAP1GAP that correlates with increased levels of alkaline phosphatase;
s) a PTV in ABCG8 that correlates with increased levels of cholesterol;
t) PTV in RNF186 or SLC22A2 that correlates with increased levels of creatinine; u) a PTV in SLC01B1 or UGT1A10 that correlates with increased levels of bilirubin; v) a PTV in RORC, SIGLEC1, or UPB1 that correlates with increased levels of gamma glutamyltransferase;
w) a PTV in ANGPTL8 that correlates with increased levels of HDL;
x) a PTV in SLC22A1 and SLC22A2 that correlates with increased levels of lipoprotein
A;
y) a PTV in COL4A4 that correlates with increased levels of microalbumin in serum and urine and an increased urine albumin to creatinine ratio;
z) a PTV in HSPA6 that correlates with increased levels of total protein and non albumin protein;
aa) a PTV in APOA5 that correlates with increased levels of triglycerides; bb) a PTV in PYGM and SLC22A11 that correlates with increased levels of urate; and cc) a PTV in APOB, DHCR7, FLG, and NPFFR2 that correlates with increased levels of vitamin D.
[0011] In certain embodiments, the individual has at least one HLA allele selected from the group consisting of: a) HLA-B*08:01 , HLA-DRB1*03:01 , or HLA-DRB1*07:01 (OR = 0.796) that correlates with abnormal rheumatoid factor levels above 16 U/mL; b) a HLA-DR3 haplotype that correlates with a predisposition for developing lupus, multiple sclerosis, or type 1 diabetes; and c) HLA-DRB1*07:01 allelotype that correlates with a predisposition for developing an asparaginase allergy.
[0012] In certain embodiments, at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement.
[0013] In certain embodiments, the clinical biomarker is a serum or urine biomarker.
[0014] In certain embodiments, the clinical biomarker is selected from the group consisting of alanine aminotransferase, albumin, alkaline phosphatase, apolipoprotein A, apolipoprotein B, aspartate aminotransferase, calcium, cholesterol, c-reactive protein, creatinine, cystatin-C, direct bilirubin, gamma glutamyltransferase, glucose, glycated hemoglobin (HbA1c), HDL cholesterol, insulin-like growth factor 1 (IGF-1), low-density lipoprotein (LDL) direct, lipoprotein-A, phosphate, sex hormone binding globulin (SHBG), testosterone, total bilirubin, total protein, triglycerides, urate, urea, vitamin D, creatinine in urine, estimated glomerular filtration rate (eGFR), microalbumin in urine, potassium in urine, sodium in urine, non-albumin protein, urine albumin to creatinine ratio higher than 30 mg/g, microalbumin higher than 40 mg/L, and rheumatoid factor higher than 16 lll/ml, estradiol higher than 212 pmol/L.
[0015] In certain embodiments, the method further comprises measuring the clinical biomarker in the individual.
[0016] In certain embodiments, at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and the disease or the medically relevant trait including, for example, without limitation, type 2 diabetes, primary biliary cirrhosis, rheumatoid arthritis, schizophrenia, lupus, ulcerative colitis, sunburn, Crohn’s disease, allergy/eczema, hypothyroidism, age of menarche, age of menopause, systolic blood pressure, basophil percentage, eosinophil percentage, hematocrit, hemoglobin concentration, reticulocyte count, reticulocyte percentage, immature reticulocyte, fraction, lymphocyte count, lymphocyte percentage, mean corpuscular hemoglobin (MCH), MCH concentration, mean corpuscular volume (MCV), mean platelet thrombocyte volume (MPV), mean reticulocyte volume, mean sphered cell volume, monocyte count, monocyte percentage, neutrophil count, neutrophil percentage, platelet count, platelet crit, platelet distribution width (PDW), red blood cell erythrocyte (RBC) count, Red blood cell erythrocyte distribution width (RDW), reticulocyte count, reticulocyte percentage, white blood cell leukocyte count (WBC) count, respiratory disease, amyotrophic lateral sclerosis (ALS), Alzheimer’s disease, age related macular degeneration (AMD), any stroke, any ischemic stroke, large artery stroke, cardioembolic stroke, small vessel stroke, age of menarche, prostate cancer, number of cancers, number of operations, average weekly beer/cider intake, average weekly spirits intake, body size at age 10, height size at age 10, fathers age at death, mothers age at death, deep venous thrombosis (DVT), gastric reflux, gall stones, kidney stone, hyperthyroidism, osteoporosis, uterine fibroids, hay fever/allergic rhinitis, enlarged prostate, gout, hiatus hernia, sitting height, birth weight, mother’s Alzheimer’s disease, neuroticism, best measure of forced expiratory volume in 1 second FEV1 (FEV 1s best), best measure of forced vital capacity (FVC best), predicted percentage of forced expiratory volume in 1 second (FEV 1s predicted), nerves anxiety tension depression, body mass index (BMI), pulse wave arterial stiffness, whole body fat mass, whole body fat free mass, whole body water mass, age of first facial hair, hair balding pattern 2, hair balding pattern 3, hair balding pattern 4, diabetes, cancer, fracture bones, oral contraceptives, hormone replacement therapy, bilateral oophorectomy, forced vital capacity (FVC), forced expiratory volume in 1 second (FEV 1s), peak expiratory flow (PEF), hysterectomy, pregnancy terminations, age of primiparous, heel bone mineral density (BMD) left, heel BMD right, pulse rate, pulse wave peak to peak, hand grip strength left, leg pain on walking, hand grip strength right, tinnitus, waist circumference, hip circumference, standing height, maximum workload during fitness test, maximum heart rate during fitness test, qualifications a/as levels, diabetes related eye disease, cataract, painful gums, bleeding gums, dentures, vascular problems, angina, high blood pressure, fractured bones, blood clot in the leg, emphysema/bronchitis, asthma, hay fever allergic rhinitis eczema, cholesterol lowering medication, hormone replacement therapy, estrogen receptor (ER) negative breast cancer, ER positive breast cancer, all breast cancer, coronary artery disease, asthma, insomnia, sleep hours, anorexia, autism, celiac disease, EGFR decline, microalbuminuria, kidney disease.
[0017] In certain embodiments, the method further comprises adjusting at least one PRS for covariates including, for example, without limitation, age, sex, socioeconomic status, ethnicity, and anthropometric measurements.
[0018] In certain embodiments, the disease is myocardial infarction, wherein the method comprises calculating at least one polygenic risk score for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement selected from tyrosine, glycoprotein acetyls, CH2 in fatty acids, arachidonic acid, pulse, sleep, vitamin D, urate, triglycerides, total protein, sodium in urine, phosphate, lipoprotein A, high density lipoprotein cholesterol, low density lipoprotein cholesterol, total cholesterol, ApoA, ApoB, Albumin, HbA1c, hemoglobin, diastolic blood pressure, CysC, proinsulin, glycoprotein, omega 6 fatty acid, macrophage colony stimulating factor, cutaneous T-cell-attracting chemokine, waist to hip ratio, fat mass, total protein, sleep hours, urate, sodium in urine, gamma glutamyltransferase, lymphocyte count, hand grip strength, forced vital capacity, fasting insulin (sex specific); and the disease, diabetes. In some embodiments, the method further comprises measuring the clinical biomarker in the individual.
[0019] In certain embodiments, the disease is diabetes, wherein the method comprises calculating at least one polygenic risk score for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement selected from waist to hip ratio, fat mass, waist circumference, pulse, sex hormone binding globulin, IGF1 , high density lipoprotein cholesterol, lipoprotein A, ApoA, alanine aminotransferase, Hip circumference, HbA1c, glucose, diastolic blood pressure, BMI, platelet derived growth factor, VEGF (vascular endothelial growth factor), total 20:0 long chain fatty acids, albumin, water intake, vitamin D, total bilirubin, testosterone, direct bilirubin, lymphocyte count, C-reactive protein, left hand grip strength, forced vital capacity, forced expiratory volume in 1 second, and total body fat, and various diabetes polygenic scores with and without adjustment for BMI. In some embodiments, the method further comprises measuring the clinical biomarker in the individual. In some embodiments, the method further comprises adjusting at least one PRS for Townsend deprivation index/socioeconomic status.
[0020] In certain embodiments, a Spearman correlation is used to generate the correlation data. [0021] In certain embodiments, the correlation data is selected from Tables 4-10 and 13.
[0022] In certain embodiments, at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement, and at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and the disease or the medically relevant trait.
[0023] In certain embodiments, the method further comprises: a) generating a predictive model using one or more algorithms, wherein said predictive model is based on at least one PRS for the genetic association with a size effect on a clinical biomarker measurement and at least one PRS for the genetic association with the disease or the medically relevant trait; and b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS. In certain embodiments, one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm. For example, a machine learning algorithm may be used including without limitation a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
[0024] In certain embodiments, the method further comprises treating the individual for the disease if the polygenic risk score indicates that the individual has the disease.
[0025] In certain embodiments, genotyping comprises sequencing at least part of a genome of one or more cells from the individual. In some embodiments, genotyping comprises sequencing the whole genome of the individual.
[0026] In another aspect, a database is provided, the database comprising correlation data between genetic variants and clinical biomarkers, diseases, and medically relevant traits, wherein the correlation data is selected from Tables 4-10 and 13.
[0027] In another aspect, a computer implemented method for predicting the risk of an individual developing a disease or medically relevant phenotypic trait is provided, the computer performing steps comprising: a) receiving genome sequencing data for an individual; b) identifying variant alleles present in the individual from the genome sequencing data, wherein the individual has a plurality of variant alleles selected from Tables 5-10 and 13; c) calculating at least one polygenic risk score using a database, as described herein, based on the variant alleles present in the individual, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait; and d) displaying information regarding the risk of the individual developing the disease or the medically relevant trait. [0028] In certain embodiments, the computer implemented method further comprises: a) generating a predictive model using one or more algorithms, wherein the predictive model is based on at least one PRS for a genetic association with a size effect on a clinical biomarker measurement and at least one PRS for a genetic association with the disease or the medically relevant trait; and b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS. In certain embodiments, one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm. For example, a machine learning algorithm may be used including without limitation a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
[0029] In certain embodiments, the computer implemented method further comprises storing the information regarding the risk of the individual developing the disease or the medically relevant phenotypic trait in a database.
[0030] In another aspect, a system for predicting the risk of an individual developing a disease or medically relevant trait using a computer implemented method described herein is provided, the system comprising: a) a storage component for storing data, wherein the storage component has instructions for predicting the risk of an individual developing a disease or medically relevant trait based on analysis of the genome sequencing data stored therein; b) a computer processor for processing the genome sequencing data using one or more algorithms, wherein the computer processor is coupled to the storage component and configured to execute the instructions stored in the storage component in order to receive the inputted genome sequencing data and analyze the data according to the computer implemented method described herein; and c) a display component for displaying the information regarding the risk of the individual developing the disease or the medically relevant trait.
[0031] In another aspect, a non-transitory computer-readable medium is provided, the non- transitory computer-readable medium comprising program instructions that, when executed by a processor in a computer, causes the processor to perform a computer implemented method for predicting the risk of an individual developing a disease or medically relevant phenotypic trait, as described herein. In another embodiment, a kit comprising the non- transitory computer-readable medium and instructions for predicting the risk of an individual developing a disease or medically relevant trait is provided. BRIEF DESCRIPTION OF THE DRAWINGS
[0032] The invention is best understood from the following detailed description when read in conjunction with the accompanying drawings. The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. It is emphasized that, according to common practice, the various features of the drawings are not to-scale. On the contrary, the dimensions of the various features are arbitrarily expanded or reduced for clarity. Included in the drawings are the following figures.
[0033] FIG. 1 shows a schematic overview of the study. We prepared a dataset of 38 serum and urine lab phenotypes from 358,072 individuals in the UK Biobank study. From these data, we analyzed their genetic basis, assessed their relationship to disease outcomes and medically relevant phenotypes, and generated predictive models from genome data.
[0034] FIGS. 2A-2E show the genetics of lab phenotypes. (FIG. 2A) Summary of large-effect protein-truncating (abs(Beta) >= 0.25) and protein-altering variants (abs(Beta) >= 0.75). All variants are directly genotyped on the genotyping array, and effect betas are the number of standard deviations changed in the phenotype per alternative allele. Each set of variants is separated by trait but overlap of individual variants in the same gene between traits is present -- a more detailed table of individual hits and cascade plots are in Tables 5-7 and FIGS. 11- 13. (FIG. 2B) (x-axis) Chromosome and (y-axis) -loglO(P-value) of association between a burden of rare CNVs overlapping a gene and the tested lab phenotype. Highlighted genes have maximum -loglO(P-value) of the association between rare burden CNVs and tested phenotype. Only genes with P-value <.01/25000 are shown. (FIG. 2C) Fraction of heritability per Chromosome across the 38 studied phenotypes. We obtained the chromosomal heritability by summing local heritability at loci within the chromosome. For each chromosome, we plot the boxplots of estimates at the 38 considered phenotypes. Outlier lab phenotypes on each chromosome for heritability per SNP are labelled. (FIG. 2D) (x-axis) Polygenic heritability estimate for 38 lab phenotypes (y-axis) using LD-score regression. Estimate and standard error intervals shown. (FIG. 2E) Enrichment of traits in different cell types. Definitions of tissue type groups are taken from Finucane et al. (Nat Genet. (2018) 50(4):621-629). Enrichments for all traits in each tissue are shown; the vast majority of enrichment across traits is in the liver and kidney, and the exceptions are highlighted.
[0035] FIGS. 3A-3B show a correlation of genetic effects and causal inference. (FIG. 3A) Correlation of genetic effects plot between the 38 lab phenotypes and 123 complex traits using LD-score regression. Cells with p < 0.001 are highlighted, and traits (n=26) with no associations are not shown. (FIG. 3B) MR-Egger and LCV predict causal links between lab measurements (blue nodes) and selected complex traits (red nodes). Associations arrows are drawn based on MR-Egger (red), LCV (blue), or both (black), and multiple arrows indicate support from multiple studies. MR-Egger and LCV were jointly adjusted for FDR 10% cutoff across all tests. Triangles are used for binary and circles for continuous summary statistics. Edge width is proportional to the absolute causal effect size, estimated by MR Egger. A complete listing of discovered associations is provided as a table (Table 15).
[0036] FIGS. 4A-4D show lab phenotype prediction from genetic data within and across populations. (FIG. 4A) Increments in predictive performance with genetic data (change in correlation, R, or ROC-AUC) for White British (x-axis) and other ethnic groups (y-axis) are shown across the 38 lab phenotypes. (FIGS. 4B-4D) Predicted vs. observed phenotypes comparison for individuals in the test sets for Lipoprotein A (FIG. 4B), LDL (FIG. 4C), and alanine aminotransferase (FIG. 4D). The diagonal line indicates x=y whereas the gray dashed line shows the linear regression fit between observed and predicted phenotypes.
[0037] FIGS. 5A-5D show a polygenic Risk Score Phenome Wide Association Study (PRS- PheWAS). (FIG. 5A) (x-axis) Biomarker polygenic risk scores at top 0.1% (top01) and top 1% (topi) and their association to different diseases in UK Biobank, represented as the odds ratio of the disease in this group relative to the middle 40-60% of individuals. (FIGS. 5B-5C) (x axis) quantiles of polygenic risk score, spaced to linearly represent the mean of the corresponding bin of scores (y axis) Prevalence of disease (binary outcome) or average measurement (continuous outcome) within each quantile bin of the polygenic risk score. Error bars represent the standard error around each measurement. (FIG. 5D) Improvement in prediction accuracy of traits when including biomarker polygenic risk scores. Each trait was tested against a model with just the covariates (for liver fat, including alcohol and interactions; for type 2 diabetes, including BMI, WHR, and interactions; for all traits, including age, sex, Townsend Deprivation Index, principal components, and interactions); with the covariates and the polygenic score for the trait (trained using SNPnet, except for type 2 diabetes, from Mahajan et al. 2018), termed “with trait PRS”; with the covariates and all the polygenic scores for biomarkers, termed“with biomarker PRSs”; and with covariates and all PRSs, termed“with all PRSs.” In each case, the trait PRS only was outperformed by including all the biomarker PRSs as well. See F-test results and regression terms in Tables 18-21.
[0038] FIG. 6 shows the proportion of variance explained by all covariates across the 37 raw laboratory phenotypes (x-axis) Regression estimate of the proportion of variance explained by all 127 covariates in a linear model for 37 raw laboratory phenotypes including Fasting glucose defined if fasting time between 8 and 24 hours according to Data Field 74 in UK Biobank Data Showcase (y-axis). Blue bar plots indicate estimate before medication adjustment and red bar plots indicate estimate after medication adjustment. [0039] FIG. 7A shows normalized regression coefficients for the 37 raw laboratory phenotypes across the covariates (x-axis) Normalized regression coefficient for 23 covariates in a linear model for the 37 raw laboratory phenotypes including Fasting glucose defined as fasting time between 8 and 24 hours according to Data Field 74 in UK Biobank Data Showcase (y-axis). Bar plots outlined in dark gray indicate estimate before medication adjustment and Bar plots outlined in light gray indicate estimate after medication adjustment.
[0040] FIG. 7B shows phenotype distributions of all biomarkers by age and sex. (x-axis) Age of individuals within a pentacontile were averaged (y-axis) The corresponding average value +/- 1 SD of each biomarker measurement for all individuals with available data in the study. Color indicates the reported sex of the individuals (orange = male, turquoise = female).
[0041] FIG. 7C shows residual distributions of all biomarkers by age and sex. (x-axis) Age of individuals within a pentacontile were averaged (y-axis) The corresponding average value +/- 1 SD of each biomarker residual for all individuals with available data in the study, after adjusting for the 127 covariates and intercept. Color indicates the reported sex of the individuals (orange = male, turquoise = female).
[0042] FIG. 8 shows the phenotype correlation among the 38 lab phenotypes. -1 (red) to 1 (blue) correlation of phenotypes (cell size indicates correlation). Only cells with p < 0.001 are shown. Results are consistent with previous work, and captures known associations between both testosterone and SHBG with uric acid (urate) levels 2.
[0043] FIG. 9 shows a correlogram of different diabetes- and diabetes-related traits. The similarity of type 2 diabetes (following Eastwood et al), high confidence diabetes (examining all available timepoints for an individual and using self-report and ICD codes), and prescription of metformin or any oral antidiabetic are compared to the biomarker measurements of HbA1c and glucose. HbA1c was adjusted for statins (see Methods) and residualized (see Methods), while glucose was subset to individuals with a fasting time between 8 and 24 hours (see Methods) to ensure effects were not driven by fasting. Diagnosed diabetes was defined by the UK Biobank during the nurse interview, and family history was defined as having at least one self-reported mother, father, or sibling (non-adopted) with diabetes. Table of correlations presented below (Table 3).
[0044] FIGS. 10A-10B show comparisons of estimated effect sizes between UK Biobank and previous GWAS. (x-axis) UK Biobank estimated effect size (y-axis) Comparative study estimated effect size. All variants associated p < 1e-6 in either study are shown. FIG. 10A shows plots for LDL vs. GLGC, HbA1c vs. MAGIC, and triglycerides vs. GLGC. FIG. 10B shows plots for urate vs. GUGC and alanine aminotransferase vs. Biobank Japan.
[0045] FIGS. 11A-11 B show cascade plots for predicted protein-truncating variants across lab phenotypes (x-axis) Minor allele frequency of genetic variant associated to phenotype (p < 1e-7) and (y-axis) BETA univariate regression coefficient estimate. Orange and labelled data points include genes with PTVs whose estimated effect size (BETA) is greater than or equal to.1 or less than or equal to -.1 standard deviation (SD). Two phenotypes (Creatinine in urine and estradiol) did not have PTV associations with p < 1e-7 and excluded from the plot.
[0046] FIGS. 12A-12B show cascade plots for predicted protein-altering variants across lab phenotypes (x-axis) Minor allele frequency of genetic variant associated to phenotype (p < 1e-7) and (y-axis) BETA univariate regression coefficient estimate. Light gray and labelled data points include genes with protein-altering variants whose estimated effect size (BETA) is greater than or equal to.1 or less than or equal to -.1.
[0047] FIGS. 13A-13C show cascade plots for non-coding variants across lab phenotypes (x-axis) Minor allele frequency of non-coding variants characterized on the imputed 1000 Genomes Phase I variant associated to phenotype (p < 5e-8) and (y-axis) BETA univariate regression coefficient estimate. Orange and labelled data points include non-coding variants whose estimated effect size (BETA) is an outlier, l.e. absolute value of estimated effect size deviates from the standard deviation range estimated from linear fit between log minor allele frequency and absolute value of estimated effect size (outlier, see methods for more details). The gene symbols are shown for splicing variants.
[0048] FIG. 14 shows posterior effect sizes, probabilities of Bayesian Model Averaging model inclusion, and linkage disequilibrium for HLA alleles on 29 different biomarker phenotypes y- axis indicates phenotype, and x -axis indicates allele. Above - the size of each dot corresponds to the posterior probability that the HLA allele is included as a variable across all plausible models as deemed by BIC measures from BMA, and the color of each dot corresponds to the size and direction of the effect of the allele on the phenotype as found by PLINK. Only the top 10 significant PLINK hits per phenotype were considered for the analysis. Below - LD measures (as determined and visualized by the gaston package) across HLA allelotypes; the measures displayed are R 2 values.
[0049] FIG. 15A shows CNV association analysis across the 38 biomarkers. X-axis Genomic coordinate and -log10(P) for single CNV association. CNV and biomarker association are highlighted when p <.05/10000 with cytogenic band labelled. FIG. 15B shows PheWAS of rare CNVs affecting HNF1 B. X-axis log-odds ratio and -log10(P) for each trait having association with HNF1 B CNVs at p < 1e-4. Associations for all traits run as in previous analysis 3.
[0050] FIG. 16 shows cumulative heritability. x-axis SNP ranked by heritability per SNP (millions) and its corresponding cumulative heritability (y-axis) across the 38 lab phenotypes. Lab phenotype label shown in the title of the subplots.
[0051] FIG. 17A shows enrichment of traits in different cell types. Definitions of tissue type groups are taken from Finucane et al (Nat. Genet. (2018) 50:621-629). Enrichments for all traits in each tissue are shown; the vast majority of enrichment across traits is in the liver and kidney. FIG. 17B shows grouped cell type heritability enrichments across ten tissues (x-axis, top) Fold enrichment with SE for each lab phenotype across 10 tissues (y-axis). (x-axis, bottom) -log10(P) value of enrichment or each lab phenotype across 10 tissues (y-axis).
[0052] FIG. 18 shows individual annotations for pancreas, liver, and kidney ChIP-seq experiments. -log10(P) (x-axis) for cell type heritability enrichment across pancreas, liver, and kidney ChIP-seq experiments (y-axis).
[0053] FIG. 19 shows phenome-wide associations across 25 protein-truncating variants and laboratory measurements and 24 disease outcomes in the UK Biobank. Targeted phenome- wide association analysis was performed for PTVs outside of the human MHC region that showed significant genome-wide associations (p < 1e-7) with at least one of the laboratory measurement traits. The log odds ratio of the significant PheWAS associations (p < 1e-5) are shown across phenotypes (x-axis) and PTVs (y-axis). The 46 significant (p < 1e-5) associations across 25 variants and 24 disease outcomes are shown as well as the associations with laboratory measurements. The color of phenotype names indicate binary disease outcomes or family history (red) or laboratory measurements (purple). The color for log odds ratio or beta = 0.2 is used for the associations with > 0.2 log odds ratio or beta.
[0054] FIG. 20 shows phenome-wide associations across 35 LD-independent protein-altering variants and 28 disease outcomes in the UK Biobank. Targeted phenome-wide association analysis was performed for protein-altering variants outside of the human MHC region that showed significant genome-wide associations (p < 1e-7) with at least one of the laboratory measurement traits. The log odds ratio of the significant PheWAS associations (p < 1e-5) are shown across phenotypes (x-axis) and protein-altering variants (y-axis). Out of 172 significant (p < 1e-5) associations across 80 LD-independent protein-altering variants and 75 disease outcomes, 35 variants and 28 disease outcomes with maximal number of significant associations are chosen for visualization. The associations for those variant-phenotype pairs are shown as well as the associations across laboratory measurement phenotypes. The color of phenotype names indicate binary disease outcomes or family history (red) and laboratory measurements (purple). The color for log odds ratio or beta = 0.2 is used for the associations with > 0.2 log odds ratio or beta.
[0055] FIG. 21 shows correlation of genetic effects between biomarkers. -1 (red) to 1 (blue) scale of correlation of genetic effects estimated using LD-score regression.
[0056] FIG. 22 shows correlation of genetic effects between biomarkers with normalization (“INT”), and with ipid-lowering therapy adjustment (“adjstatins”) and without. -1 (red) to 1 (blue) scale of correlation of genetic effects estimated using LD-score regression. [0057] FIG. 23 shows correlation of genetic effects between normalized (“I NT”) lab phenotypes with lipid-lowering therapy adjustment (“adjstatins”) and without. -1 (red) to 1 (blue) scale of correlation of genetic effects estimated using LD-score regression.
[0058] FIG. 24A shows“Lake” plots of GWAS p-value and the magnitude of effect size estimates from snpnet for Lipoprotein A. (x-axis) Genomic coordinates for (top panel) - log10(P) from GWAS and (bottom panel) absolute value of estimated effect size using snpnet (abs(BETA) from snpnet). FIG. 24B shows“Lake” plots of GWAS p-value and the magnitude of effect size estimates from snpnet for LDL. (x-axis) Genomic coordinates for (top panel) - log10(P) from GWAS and (bottom panel) absolute value of estimated effect size using snpnet (abs(BETA) from snpnet). FIG. 24C shows“Lake” plots of GWAS p-value and the magnitude of effect size estimates from snpnet for Alanine Aminotransferase (x-axis) Genomic coordinates for (top panel) -log10(P) from GWAS and (bottom panel) absolute value of estimated effect size using snpnet (abs(BETA) from snpnet).
[0059] FIG. 25 shows lab phenotype prediction from genetic data within and across populations. The predictive performance with both genetic data and covariates (correlation, R) for White British (x-axis) and other ethnic groups (y-axis) are shown across the 38 lab phenotypes.
[0060] FIG. 26 shows an evaluation of the prevalence of type 2 diabetes based on precision polygenic risk scores for clinical laboratory tests of serum and urine, including lipids, hormones, and measures of kidney function.
DETAILED DESCRIPTION OF EMBODIMENTS
[0061] Methods, systems, and devices, including computer programs encoded on a computer storage medium are provided for predicting the risk of an individual developing a polygenic disease or medically relevant trait. In particular, methods are provided for using genetic information based on the detection of multiple genetic variants in an individual for diagnosing polygenic diseases, correlating phenotypic characteristics with genetic data, and predicting the risk of developing a disease or medically relevant condition by analyzing polygenic contributions to the disease and underlying changes in physical traits and clinically measured biomarkers.
[0062] Before the present methods, systems, and devices are described, it is to be understood that this invention is not limited to particular methods or compositions described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims. [0063] Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
[0064] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, some potential and preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. It is understood that the present disclosure supersedes any disclosure of an incorporated publication to the extent there is a contradiction.
[0065] As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present invention. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.
[0066] It must be noted that as used herein and in the appended claims, the singular forms "a", "an", and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to "a cell" includes a plurality of such cells and reference to "the nucleic acid" includes reference to one or more nucleic acids and equivalents thereof, e.g. polynucleotides, known to those skilled in the art, and so forth.
[0067] The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed. [0068] Biological sample. The term“sample” with respect to an individual encompasses blood, urine, and other liquid samples of biological origin, solid tissue samples such as a biopsy specimen or tissue cultures or cells derived or isolated therefrom and the progeny thereof. The definition also includes samples that have been manipulated in any way after their procurement, such as by treatment with reagents; washed; or enrichment for certain cell populations, such as cancer cells. The definition also includes samples that have been enriched for particular types of molecules, e.g., nucleic acids, polypeptides, etc.
[0069] DNA samples, e.g. samples useful in genotyping, are readily obtained from any nucleated cells of an individual, e.g. hair follicles, cheek swabs, white blood cells, etc., as known in the art.
[0070] The term“biological sample” encompasses a clinical sample. The types of“biological samples” include, but are not limited to: biological fluids, tissue samples, tissue obtained by surgical resection, tissue obtained by biopsy, cells in culture, cell supernatants, cell lysates, organs, bone marrow, blood, plasma, serum, saliva, urine, fine needle aspirate, lymph node aspirate, cystic aspirate, a paracentesis sample, a thoracentesis sample, and the like.
[0071] Obtaining and assaying a sample. The term“assaying” is used herein to include the physical steps of manipulating a biological sample to generate data related to the sample. As will be readily understood by one of ordinary skill in the art, a biological sample must be “obtained” prior to assaying the sample. Thus, the term“assaying” implies that the sample has been obtained. The terms“obtained” or“obtaining” as used herein encompass the act of receiving an extracted or isolated biological sample. For example, a testing facility can“obtain” a biological sample in the mail (or via delivery, etc.) prior to assaying the sample. In some such cases, the biological sample was“extracted” or“isolated” from an individual by another party prior to mailing (i.e. , delivery, transfer, etc.), and then“obtained” by the testing facility upon arrival of the sample. Thus, a testing facility can obtain the sample and then assay the sample, thereby producing data related to the sample.
[0072] The terms“obtained” or“obtaining” as used herein can also include the physical extraction or isolation of a biological sample from a subject. Accordingly, a biological sample can be isolated from a subject (and thus“obtained”) by the same person or same entity that subsequently assays the sample. When a biological sample is“extracted” or“isolated” from a first party or entity and then transferred (e.g., delivered, mailed, etc.) to a second party, the sample was“obtained” by the first party (and also“isolated” by the first party), and then subsequently “obtained” (but not“isolated”) by the second party. Accordingly, in some embodiments, the step of obtaining does not comprise the step of isolating a biological sample. [0073] In some embodiments, the step of obtaining comprises the step of isolating a biological sample (e.g., a pre-treatment biological sample, a post-treatment biological sample, etc.). Methods and protocols for isolating various biological samples (e.g., a blood sample, a serum sample, a plasma sample, a urine sample, a biopsy sample, an aspirate, etc.) will be known to one of ordinary skill in the art and any convenient method may be used to isolate a biological sample.
[0074] The terms “determining”, “measuring”, “evaluating”, “assessing,” “assaying,” and “analyzing” are used interchangeably herein to refer to any form of measurement, and include determining if an element is present or not. These terms include both quantitative and/or qualitative determinations. Assaying may be relative or absolute. For example,“assaying” can be determining whether the level of a clinical biomarker is less than or“greater than or equal to” a particular threshold, (the threshold can be pre-determined or can be determined by assaying a control sample). On the other hand,“assaying to determine the level” can mean determining a quantitative value (using any convenient metric) that represents the level of a clinical biomarker.
[0075] The terms "treatment", "treating", "treat" and the like are used herein to generally refer to obtaining a desired pharmacologic and/or physiologic effect. The effect can be prophylactic in terms of completely or partially preventing a disease or symptom(s) thereof and/or may be therapeutic in terms of a partial or complete stabilization or cure for a disease and/or adverse effect attributable to the disease. The term“treatment" encompasses any treatment of a disease in a mammal, particularly a human, and includes: (a) preventing the disease and/or symptom(s) from occurring in a subject who may be predisposed to the disease or symptom but has not yet been diagnosed as having it; (b) inhibiting the disease and/or symptom(s), i.e. , arresting their development; or (c) relieving the disease symptom(s), i.e., causing regression of the disease and/or symptom(s). Those in need of treatment include those already inflicted (e.g., those with cancer, those with an infection, etc.) as well as those in which prevention is desired (e.g., those with increased susceptibility to cancer, those suspected of having cancer, etc.).
[0076] A therapeutic treatment is one in which the subject is inflicted prior to administration and a prophylactic treatment is one in which the subject is not inflicted prior to administration. In some embodiments, the subject has an increased likelihood of becoming inflicted or is suspected of being inflicted prior to treatment. In some embodiments, the subject is suspected of having an increased likelihood of becoming inflicted.
[0077] "Substantially purified" generally refers to isolation of a substance (e.g., compound, molecule, agent) such that the substance comprises the majority percent of the sample in which it resides. Typically in a sample, a substantially purified component comprises 50%, preferably 80%-85%, more preferably 90-95% of the sample.
[0078] By "isolated" is meant an indicated cell, population of cells, or molecule is separate and discrete from a whole organism or is present in the substantial absence of other cells or biological macromolecules of the same type.
[0079] The terms "subject," "individual" or "patient" are used interchangeably herein and refer to a vertebrate, preferably a mammal. By "vertebrate" is meant any member of the subphylum chordata, including, without limitation, humans and other primates, including non-human primates such as chimpanzees and other apes and monkey species; farm animals such as cattle, sheep, pigs, goats and horses; domestic mammals such as dogs and cats; laboratory animals including rodents such as mice, rats and guinea pigs; birds, including domestic, wild and game birds such as chickens, turkeys and other gallinaceous birds, ducks, geese, and the like. The term does not denote a particular age. Thus, both adult and newborn individuals are intended to be covered.
[0080] As used herein, the term "probe" refers to a polynucleotide that contains a nucleic acid sequence complementary to a nucleic acid sequence present in the target nucleic acid analyte (e.g., at location of a mutation). The polynucleotide regions of probes may be composed of DNA, and/or RNA, and/or synthetic nucleotide analogs. Probes may be labeled in order to detect the target sequence. Such a label may be present at the 5’ end, at the 3’ end, at both the 5’ and 3’ ends, and/or internally.
[0081] An "allele-specific probe" hybridizes to only one of the possible alleles of a gene (e.g., hybridizes at the location of a mutation) under suitably stringent hybridization conditions.
[0082] The term "primer" as used herein, refers to an oligonucleotide that hybridizes to the template strand of a nucleic acid and initiates synthesis of a nucleic acid strand complementary to the template strand when placed under conditions in which synthesis of a primer extension product is induced, i.e. , in the presence of nucleotides and a polymerization-inducing agent such as a DNA or RNA polymerase and at suitable temperature, pH, metal concentration, and salt concentration. The primer is preferably single-stranded for maximum efficiency in amplification, but may alternatively be double-stranded. If double-stranded, the primer can first be treated to separate its strands before being used to prepare extension products. This denaturation step is typically effected by heat, but may alternatively be carried out using alkali, followed by neutralization. Thus, a "primer" is complementary to a template, and complexes by hydrogen bonding or hybridization with the template to give a primer/template complex for initiation of synthesis by a polymerase, which is extended by the addition of covalently bonded bases linked at its 3' end complementary to the template in the process of DNA or RNA synthesis. Typically, nucleic acids are amplified using at least one set of oligonucleotide primers comprising at least one forward primer and at least one reverse primer capable of hybridizing to regions of a nucleic acid flanking the portion of the nucleic acid to be amplified.
[0083] An "allele-specific primer" matches the sequence exactly of only one of the possible alleles of a gene (e.g., hybridizes at the location of a mutation), and amplifies only one specific allele if it is present in a nucleic acid amplification reaction.
[0084] The term "common genetic variant" or "common variant" refers to a genetic variant having a minor allele frequency (MAF) of greater than 5%.
[0085] The term "rare genetic variant" or "rare variant" refers to a genetic variant having a minor allele frequency (MAF) of less than or equal to 5%.
Methods
[0086] Methods are provided for determining whether an individual is likely to develop a polygenic disease or medically relevant trait. Most common diseases are caused by dysregulation of multiple genes. A predictive model is provided that estimates the risk of developing a disease or medically relevant condition by analyzing polygenic contributions to the disease and underlying changes in physical traits and clinically measured biomarkers. The method typically involves genotyping an individual to identify genetic variants present in the genome that may be associated with a polygenic disease or medically relevant phenotypic trait, and using a database to calculate a polygenic risk score, wherein the database comprises correlation data for associations between genetic variants and diseases or medically relevant traits based on genome-wide testing of a population for genetic variants associated with the disease or the medically relevant trait. The risk of an individual developing a disease or medically relevant trait is assessed from calculation of polygenic risk scores based on the genetic variants detected in the individual, as described further below (see Examples).
[0087] The methods described herein are useful for identifying individuals in need of close monitoring and treatment for a polygenic disease or medically relevant condition. High risk individuals may be monitored more frequently for the development of symptoms of a polygenic disease, for example, by testing for disease relevant clinical biomarkers and changes in health status with prompt attention to any disease-relevant changes in health. The methods are also of use for determining a therapeutic regimen or determining if a subject will benefit from treatment with a therapeutic regimen. For example, a subject identified as having a genetic predisposition to developing a polygenic disease or medically relevant condition may be treated in advance of developing symptoms of the disease to prevent physical damage that would be caused in the absence of treatment. Such treatment may include, for example, without limitation, prescribing drugs that delay or minimize the risk of development of a disease, adjusting diet and/or levels of physical exercise, or administering gene therapy (e.g., modulating expression or activity of a gene or introducing a functional gene to compensate for the presence of a mutant allele having deficient or abnormal activity). In addition, the methods described herein may be useful for confirming the diagnosis of a subject already showing symptoms of disease, who should be administered treatment for the disease.
Genotyping
[0088] Individuals may be genotyped to detect genetic variants by any convenient method known in the art. The genetic variants detected may include common or rare genetic variants, such as mutations (e.g., nucleotide replacements, insertions, or deletions) and alterations of copy number. In certain embodiments, the genetic variants are protein-truncating variants (PTVs), protein-altering variants, non-coding variants, single nucleotide variants, or human leukocyte antigen (HLA) allelotypes. In some embodiments, the genetic variants are associated with a known phenotype of interest (e.g., disease or condition).
[0089] For genetic testing, a biological sample containing nucleic acids is collected from an individual. The biological sample is typically saliva or cells from buccal swabbing, but can be any sample from bodily fluids, tissue or cells that contains genomic DNA or RNA of the individual. In certain embodiments, nucleic acids from the biological sample are isolated, purified, and/or amplified prior to analysis using methods well-known in the art. See, e.g., Green and Sambrook Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Laboratory Press; 4th edition, 2012); and Current Protocols in Molecular Biology ( Ausubel ed., John Wiley & Sons, 1995); herein incorporated by reference in their entireties.
[0090] Detection of a mutation can be direct or indirect. For example, the mutated gene itself can be detected directly. Alternatively, the mutation can be detected indirectly from cDNAs, amplified RNAs or DNAs, or proteins expressed by a mutated allele. Any method that detects a base change in a nucleic acid sample or an amino acid change in a protein can be used. For example, allele-specific probes that specifically hybridize to a nucleic acid containing the mutated sequence can be used to detect the mutation. A variety of nucleic acid hybridization formats are known to those skilled in the art. For example, common formats include sandwich assays and competition or displacement assays. Hybridization techniques are generally described in Hames, and Higgins "Nucleic Acid Hybridization, A Practical Approach," IRL Press (1985); Gall and Pardue, Proc. Natl. Acad. Sci. U.S.A., 63:378-383 (1969); and John et al Nature, 223:582-587 (1969).
[0091] Sandwich assays are commercially useful hybridization assays for detecting or isolating nucleic acids. Such assays utilize a "capture" nucleic acid covalently immobilized to a solid support and a labeled "signal" nucleic acid in solution. The clinical sample will provide the target nucleic acid. The "capture" nucleic acid and "signal" nucleic acid probe hybridize with the target nucleic acid to form a "sandwich" hybridization complex.
[0092] In one embodiment, the allele-specific probe is a molecular beacon. Molecular beacons are hairpin shaped oligonucleotides with an internally quenched fluorophore. Molecular beacons typically comprise four parts: a loop of about 18-30 nucleotides, which is complementary to the target nucleic acid sequence; a stem formed by two oligonucleotide regions that are complementary to each other, each about 5 to 7 nucleotide residues in length, on either side of the loop; a fluorophore covalently attached to the 5' end of the molecular beacon, and a quencher covalently attached to the 3' end of the molecular beacon. When the beacon is in its closed hairpin conformation, the quencher resides in proximity to the fluorophore, which results in quenching of the fluorescent emission from the fluorophore. In the presence of a target nucleic acid having a region that is complementary to the strand in the molecular beacon loop, hybridization occurs resulting in the formation of a duplex between the target nucleic acid and the molecular beacon. Hybridization disrupts intramolecular interactions in the stem of the molecular beacon and causes the fluorophore and the quencher of the molecular beacon to separate resulting in a fluorescent signal from the fluorophore that indicates the presence of the target nucleic acid sequence.
[0093] For detection, the molecular beacon is designed to only emit fluorescence when bound to a specific allele of a gene. When the molecular beacon probe encounters a target sequence with as little as one non-complementary nucleotide, the molecular beacon preferentially stay in its natural hairpin state and no fluorescence is observed because the fluorophore remains quenched. See, e.g., Nguyen et al. (2011) Chemistry 17(46):13052-13058; Sato et al. (2011) Chemistry 17(41):11650-11656; Li et al. (2011) Biosens Bioelectron. 26(5):2317-2322; Guo et al. (2012) Anal. Bioanal. Chem. 402(10):3115-3125; Wang et al. (2009) Angew. Chem. Int. Ed. Engl. 48(5):856-870; and Li et al. (2008) Biochem. Biophys. Res. Commun. 373(4):457- 461 ; herein incorporated by reference in their entireties.
[0094] In another embodiment, detection of the mutated sequence is performed using allele- specific amplification. In the case of PCR, amplification primers can be designed to bind to a portion of one of the disclosed genes, and the terminal base at the 3’ end is used to discriminate between the major and minor alleles or mutant and wild-type forms of the genes. If the terminal base matches the major or minor allele, polymerase-dependent three prime extension can proceed. Amplification products can be detected with specific probes. This method for detecting point mutations or polymorphisms is described in detail by Sommer et al. in Mayo Clin. Proc. 64:1361-1372 (1989).
[0095] Tetra-primer ARMS-PCR uses two pairs of primers that can amplify two alleles of a gene in one PCR reaction. Allele-specific primers are used that hybridize at the location of the mutated sequence, but each matches perfectly to only one of the possible alleles. If a given allele is present in the PCR reaction, the primer pair specific to that allele will amplify that allele, but not the other allele of the gene. The two primer pairs for the different alleles may be designed such that their PCR products are of significantly different length, which allows them to be distinguished readily by gel electrophoresis. See, e.g., Munoz et al. (2009) J. Microbiol. Methods. 78(2):245-246 and Chiapparino et al. (2004) Genome. 47(2):414-420; herein incorporated by reference.
[0096] Mutations in a gene may also be detected by ligase chain reaction (LCR) or ligase detection reaction (LDR). The specificity of the ligation reaction is used to discriminate between the major and minor alleles of a gene. Two probes are hybridized at the site of the mutation in a nucleic acid of interest, whereby ligation can only occur if the probes are identical to the target sequence. See e.g., Psifidi et al. (2011) PLoS One 6(1):e14560; Asari et al. (2010) Mol. Cell. Probes. 24(6):381-386; Lowe et al. (2010) Anal Chem. 82(13):5810-5814; herein incorporated by reference.
[0097] As another example, an array comprising probes for detecting mutant alleles can be used. For example, SNP arrays are commercially available from Affymetrix and lllumina, which use multiple sets of short oligonucleotide probes for detecting known SNPs. The design of SNP arrays, such as manufactured by Affymetrix or lllumina, is described further in LaFamboise, "Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances," Nuc. Acids Res. 37(13):4181-4193 (2009).
[0098] Another method that can be used for detection of mutant alleles is PCR-dynamic allele specific hybridization (DASH), which involves dynamic heating and coincident monitoring of DNA denaturation, as disclosed by Howell et al. (Nat. Biotech. 17:87-88, 1999). A target sequence is amplified (e.g., by PCR) using one biotinylated primer. The biotinylated product strand is bound to a streptavidin-coated microtiter plate well (or other suitable surface), and the non-biotinylated strand is rinsed away with alkali wash solution. An oligonucleotide probe, specific for one allele (e.g., the wild-type allele), is hybridized to the target at low temperature. This probe forms a duplex DNA region that interacts with a double strand-specific intercalating dye. When subsequently excited, the dye emits fluorescence proportional to the amount of double-stranded DNA (probe-target duplex) present. The sample is then steadily heated while fluorescence is continually monitored. A rapid fall in fluorescence indicates the denaturing temperature of the probe-target duplex. Using this technique, a single-base mismatch between the probe and target results in a significant lowering of melting temperature (Tm) that can be readily detected.
[0099] A variety of other techniques can be used to detect mutations, including but not limited to, the Invader assay with Flap endonuclease (FEN), the Serial Invasive Signal Amplification Reaction (SISAR), the oligonucleotide ligase assay, restriction fragment length polymorphism (RFLP), single-strand conformation polymorphism, temperature gradient gel electrophoresis (TGGE), and denaturing high performance liquid chromatography (DHPLC). See, for example Molecular Analysis and Genome Discovery (R. Rapley and S. Harbron eds., Wley 1st edition, 2004); Jones et al. (2009) New Phytol. 183(4): 935- 966; Kwok et al. (2003) Curr Issues Mol. Biol. 5(2):43-60; Munoz et al. (2009) J. Microbiol. Methods. 78(2):245-246; Chiapparino et al. (2004) Genome. 47(2):414-420; Olivier (2005) Mutat Res. 573(1-2): 103-110; Hsu et al. (2001) Clin. Chem. 47(8): 1373-1377; Hall et al. (2000) Proc. Natl. Acad. Sci. U.S.A. 97(15):8272- 8277; Li et al. (2011) J. Nanosci. Nanotechnol. 1 1 (2): 994- 1003; Tang et al. (2009) Hum. Mutat. 30(10): 1460-1468; Chuang et al. (2008) Anticancer Res. 28(4A):2001-2007; Chang et al. (2006) BMC Genomics 7:30; Galeano et al. (2009) BMC Genomics 10:629; Larsen et al. (2001) Pharmacogenomics 2(4):387-399; Yu et al. (2006) Curr. Protoc. Hum. Genet. Chapter 7: Unit 7.10; Lilleberg (2003) Curr. Opin. Drug Discov. Devel. 6(2):237-252; and U.S. Pat. Nos.
4,666,828; 4,801 ,531 ; 5, 110,920; 5,268,267; 5,387,506; 5,691 , 153; 5,698,339; 5,736,330;
5,834,200; 5,922,542; and 5,998, 137 for a description of such methods; herein incorporated by reference in their entireties.
[00100] If the mutation is located in the coding region, the mutation can be identified indirectly by detection of the variant protein produced by the mutant allele. Variant proteins (i.e. , containing an amino acid substitution encoded by the mutant allele) can be detected using antibodies specific for the variant protein. For example, immunoassays that can be used to detect variant proteins produced by mutant alleles include, but are not limited to, immunohistochemistry (IHC), western blotting, enzyme-linked immunosorbent assay (ELISA), radioimmunoassays (RIA), "sandwich" immunoassays, fluorescent immunoassays, and immunoprecipitation assays, the procedures of which are well known in the art (see, e.g., Schwarz et al. (2010) Clin. Chem. Lab. Med. 48(12): 1745-1749; The Immunoassay Handbook (D.G. Wild ed., Elsevier Science; 3rd edition, 2005); Ausubel et al, eds, 1994, Current Protocols in Molecular Biology, Vol. 1 (John Wiley & Sons, Inc., New York); Coligan Current Protocols in Immunology (1991); Harlow & Lane, Antibodies: A Laboratory Manual (1988); Handbook of Experimental Immunology, Vols. I-IV (D.M. Weir and C.C. Blackwell eds., Blackwell Scientific Publications); herein incorporated by reference herein in their entireties).
[00101] In certain embodiments, a probe set is used, wherein the probe set comprises a plurality of allele-specific probes for detecting mutations in the subject's genome. The probe set may comprise one or more allele-specific polynucleotide probes. An allele-specific probe hybridizes to only one of the possible alleles of a gene under suitably stringent hybridization conditions. Individual polynucleotide probes comprise a nucleotide sequence derived from the nucleotide sequence of the target mutated allele sequences or complementary sequences thereof. The nucleotide sequence of the polynucleotide probe is designed such that it corresponds to, or is complementary to the target mutated allele sequences. The allele- specific polynucleotide probe can specifically hybridize under either stringent or lowered stringency hybridization conditions to a region of the target mutated allele sequences, to the complement thereof, or to a nucleic acid sequence (such as a cDNA) derived therefrom.
[00102] The selection of the allele-specific polynucleotide probe sequences and determination of their uniqueness may be carried out in silico using techniques known in the art, for example, based on a BLASTN search of the polynucleotide sequence in question against gene sequence databases, such as the Human Genome Sequence, UniGene, dbEST or the non- redundant database at NCBI. In one embodiment of the invention, the allele-specific polynucleotide probe is complementary to the region of a single mutated allele target DNA or mRNA sequence. Computer programs can also be employed to select allele-specific probe sequences that may not cross hybridize or may not hybridize non-specifically.
[00103] The allele-specific polynucleotide probes of the present invention may range in length from about 15 nucleotides to the full length of the coding target or non-coding target. In one embodiment of the invention, the polynucleotide probes are at least about 15 nucleotides in length. In another embodiment, the polynucleotide probes are at least about 20 nucleotides in length. In a further embodiment, the polynucleotide probes are at least about 25 nucleotides in length. In another embodiment, the polynucleotide probes are between about 15 nucleotides and about 500 nucleotides in length. In other embodiments, the polynucleotide probes are between about 15 nucleotides and about 450 nucleotides, about 15 nucleotides and about 400 nucleotides, about 15 nucleotides and about 350 nucleotides, about 15 nucleotides and about 300 nucleotides, about 15 nucleotides and about 250 nucleotides, about 15 nucleotides and about 200 nucleotides in length. In some embodiments, the probes are at least 15 nucleotides in length. In some embodiments, the probes are at least 15 nucleotides in length. In some embodiments, the probes are at least 20 nucleotides, at least 25 nucleotides, at least 50 nucleotides, at least 75 nucleotides, at least 100 nucleotides, at least 125 nucleotides, at least 150 nucleotides, at least 200 nucleotides, at least 225 nucleotides, at least 250 nucleotides, at least 275 nucleotides, at least 300 nucleotides, at least 325 nucleotides, at least 350 nucleotides, at least 375 nucleotides in length.
[00104] The allele-specific polynucleotide probes of a probe set can comprise RNA, DNA, RNA or DNA mimetics, or combinations thereof, and can be single-stranded or double-stranded. Thus, the polynucleotide probes can be composed of naturally-occurring nucleobases, sugars and covalent internucleoside (backbone) linkages as well as polynucleotide probes having non-naturally-occurring portions which function similarly. Such modified or substituted polynucleotide probes may provide desirable properties such as, for example, enhanced affinity for a target gene and increased stability. The probe set may comprise a coding target and/or a non-coding target. Preferably, the probe set comprises a combination of a coding target and non-coding target.
[00105] In another embodiment, a set of allele-specific primers is used, wherein the set of allele-specific primers comprises a plurality of allele-specific primers for detecting mutations in the subject's genome. An allele-specific primer matches the sequence exactly of only one of the possible mutated alleles, hybridizes at the location of the mutation, and amplifies only one specific mutated allele if it is present in a nucleic acid amplification reaction. For use in amplification reactions such as PCR, a pair of primers can be used for detection of a mutated allele sequence. Each primer is designed to hybridize selectively to a single allele at the site of the mutation in the gene under stringent conditions, particularly under conditions of high stringency, as known in the art. The pairs of allele-specific primers are usually chosen so as to generate an amplification product of at least about 50 nucleotides, more usually at least about 100 nucleotides. Algorithms for the selection of primer sequences are generally known, and are available in commercial software packages. These primers may be used in standard quantitative or qualitative PCR-based assays for SNP genotyping of subjects. Alternatively, these primers may be used in combination with probes, such as molecular beacons in amplifications using real-time PCR.
[00106] A label can optionally be attached to or incorporated into an allele-specific probe or primer polynucleotide to allow detection and/or quantitation of a target mutated allele sequence. The target mutated polynucleotide may be from genomic DNA, expressed RNA, a cDNA copy thereof, or an amplification product derived therefrom, and may be the positive or negative strand, so long as it can be specifically detected in the assay being used. Similarly, an antibody may be labeled that detects a polypeptide expression product of the mutated allele.
[00107] In certain multiplex formats, labels used for detecting different mutant alleles may be distinguishable. The label can be attached directly (e.g., via covalent linkage) or indirectly, e.g., via a bridging molecule or series of molecules (e.g., a molecule or complex that can bind to an assay component, or via members of a binding pair that can be incorporated into assay components, e.g. biotin-avidin or streptavidin). Many labels are commercially available in activated forms which can readily be used for such conjugation (for example through amine acylation), or labels may be attached through known or determinable conjugation schemes, many of which are known in the art.
[00108] Detectable labels useful in the practice of the invention may include any molecule or substance capable of detection, including, but not limited to, fluorescers, chemiluminescers, chromophores, bioluminescent proteins, enzymes, enzyme substrates, enzyme cofactors, enzyme inhibitors, isotopic labels, semiconductor nanoparticles, dyes, metal ions, metal sols, ligands (e.g., biotin, streptavidin or haptens) and the like. The term "fluorescer" refers to a substance or a portion thereof which is capable of exhibiting fluorescence in the detectable range. Particular examples of labels which may be used in the practice of the invention include, but are not limited to, SYBR green, SYBR gold, a CAL Fluor dye such as CAL Fluor Gold 540, CAL Fluor Orange 560, CAL Fluor Red 590, CAL Fluor Red 610, and CAL Fluor Red 635, a Quasar dye such as Quasar 570, Quasar 670, and Quasar 705, an Alexa Fluor such as Alexa Fluor 350, Alexa Fluor 488, Alexa Fluor 546, Alexa Fluor 555, Alexa Fluor 594, Alexa Fluor 647, and Alexa Fluor 784, a cyanine dye such as Cy 3, Cy3.5, Cy5, Cy5.5, and Cy7, fluorescein, 2', 4', 5', 7'-tetrachloro-4-7-dichlorofluorescein (TET), carboxyfluorescein (FAM), 6-carboxy-4',5'-dichloro-2',7'-dimethoxyfluorescein (JOE), hexachlorofluorescein (HEX), rhodamine, carboxy-X-rhodamine (ROX), tetramethyl rhodamine (TAMRA), FITC, dansyl, umbelliferone, dimethyl acridinium ester (DMAE), Texas red, luminol, and quantum dots, enzymes such as alkaline phosphatase (AP), beta-lactamase, chloramphenicol acetyltransferase (CAT), adenosine deaminase (ADA), aminoglycoside phosphotransferase (neor, G4181) dihydrofolate reductase (DHFR), hygromycin-B-phosphotransferase (HPH), thymidine kinase (TK), b-galactosidase (lacZ), and xanthine guanine phosphoribosyltransferase (XGPRT), beta-glucuronidase (gus), placental alkaline phosphatase (PLAP), and secreted embryonic alkaline phosphatase (SEAP). Enzyme tags are used with their cognate substrate. The terms also include chemiluminescent labels such as luminol, isoluminol, acridinium esters, and peroxyoxalate and bioluminescent proteins such as firefly luciferase, bacterial luciferase, Renilla luciferase, and aequorin. The terms also include isotopic labels, including radioactive and non-radioactive isotopes, such as, 3H, 2H,120I, 123l, 124l, 125l, 1311, 35S, 11C, 13C, 14C, 32P , 15N, 13N, 110ln, 111 In, 177Lu, 18F, 52Fe, 62Cu, 64Cu, 67Cu, 67Ga, 68Ga, 86Y, 90Y, 89Zr, 94mTc, 94Tc, 99mTc, 154Gd, 155Gd, 156Gd, 157Gd, 158Gd, 150, 186Re, 188Re, 51M, 52mMn, 55Co, 72As, 75Br, 76Br, 82mRb, and 83Sr. The terms also include color-coded microspheres of known fluorescent light intensities (see e.g., microspheres with xMAP technology produced by Luminex (Austin, TX); microspheres containing quantum dot nanocrystals, for example, containing different ratios and combinations of quantum dot colors (e.g., Qdot nanocrystals produced by Life Technologies (Carlsbad, CA); glass coated metal nanoparticles (see e.g., SERS nanotags produced by Nanoplex Technologies, Inc. (Mountain View, CA); barcode materials (see e.g., sub-micron sized striped metallic rods such as Nanobarcodes produced by Nanoplex Technologies, Inc.), encoded microparticles with colored bar codes (see e.g., CellCard produced by Vitra Bioscience, vitrabio.com), glass microparticles with digital holographic code images (see e.g., CyVera microbeads produced by lllumina (San Diego, CA), near infrared (NIR) probes, and nanoshells. The terms also include contrast agents such as ultrasound contrast agents (e.g. SonoVue microbubbles comprising sulfur hexafluoride, Optison microbubbles comprising an albumin shell and octafluoropropane gas core, Levovist microbubbles comprising a lipid/galactose shell and an air core, Perflexane lipid microspheres comprising perfluorocarbon microbubbles, and Perflutren lipid microspheres comprising octafluoropropane encapsulated in an outer lipid shell), magnetic resonance imaging (MRI) contrast agents (e.g., gadodiamide, gadobenic acid, gadopentetic acid, gadoteridol, gadofosveset, gadoversetamide, gadoxetic acid), and radiocontrast agents, such as for computed tomography (CT), radiography, or fluoroscopy (e.g., diatrizoic acid, metrizoic acid, iodamide, iotalamic acid, ioxitalamic acid, ioglicic acid, acetrizoic acid, iocarmic acid, methiodal, diodone, metrizamide, iohexol, ioxaglic acid, iopamidol, iopromide, iotrolan, ioversol, iopentol, iodixanol, iomeprol, iobitridol, ioxilan, iodoxamic acid, iotroxic acid, ioglycamic acid, adipiodone, iobenzamic acid, iopanoic acid, iocetamic acid, sodium iopodate, tyropanoic acid, and calcium iopodate). As with many of the standard procedures associated with the practice of the invention, skilled artisans will be aware of additional labels that can be used.
[00109] Genotyping may also comprise sequencing nucleic acids from a sample collected from an individual using any convenient sequencing protocol. Sequencing platforms that can be used include but are not limited to: pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, second-generation sequencing, nanopore sequencing, sequencing by ligation, or sequencing by hybridization. Preferred sequencing platforms are those commercially available from lllumina (RNA-Seq) and Helicos (Digital Gene Expression or“DGE”). “Next generation” sequencing methods include, but are not limited to those commercialized by: 1) 454/Roche Lifesciences including but not limited to the methods and apparatus described in Margulies et al., Nature (2005) 437:376-380 (2005); and US Patent Nos. 7,244,559; 7,335,762; 7,211 ,390; 7,244,567; 7,264,929; 7,323,305; 2) Helicos BioSciences Corporation (Cambridge, MA) as described in U.S. application Ser. No. 11/167046, and US Patent Nos. 7501245; 7491498; 7,276,720; and in U.S. Patent Application Publication Nos. US20090061439; US20080087826; US20060286566; US20060024711 ; US20060024678; US20080213770; and US20080103058; 3) Applied Biosystems (e.g. SOLiD sequencing); 4) Dover Systems (e.g., Polonator G.007 sequencing); 5) lllumina as described US Patent Nos. 5,750,341 ; 6,306,597; and 5,969,119; and 6) Pacific Biosciences as described in US Patent Nos. 7,462,452; 7,476,504; 7,405,281 ; 7,170,050; 7,462,468; 7,476,503; 7,315,019; 7,302,146; 7,313,308; and US Application Publication Nos. US20090029385; US20090068655; US20090024331 ; and US20080206764. All references are herein incorporated by reference. Such methods and apparatuses are provided here by way of example and are not intended to be limiting. [00110] Genetic testing services exist, which provide full genome sequencing using massively parallel sequencing. Massively parallel sequencing is described e.g. in US 5,695,934, entitled "Massively parallel sequencing of sorted polynucleotides," and US 2010/0113283 A1 , entitled "Massively multiplexed sequencing." Massively parallel sequencing typically involves obtaining DNA representing an entire genome, fragmenting it, and obtaining millions of random short sequences, which are assembled by mapping them to a reference genome sequence. Commercial services are available that are capable of genotyping approximately 1 million sequences for a fixed fee.
[00111] Genetic analysis can be carried out with a variety of methods that do not involve massively parallel random sequencing. For example, a commercially available MassARRAY system can be used. This system uses matrix-assisted laser desorption ionization time-of- flight mass spectrometry (MALDI-TOF MS) coupled with single-base extension PCR for high- throughput multiplex detection of mutations. Another commercial system, the lllumina Golden Gate assay, generates mutation-specific PCR products that are subsequently hybridized to beads either on a solid matrix or in solution. Three oligonucleotides are synthesized for each mutant: two allele specific oligonucleotides (ASOs) that distinguish the mutated sequence, and a locus specific sequence (LSO) just downstream of the mutation site. The ASO and LSO sequences also contain target sequences for a set of universal primers, while each LSO also contains a particular address sequences (the "illumicode") complementary to sequences attached to beads.
[00112] In some embodiments, gene duplication or genomic copy number variation is detected.
For example, 1 , 2, 3, 4, 5, or 6 or more copies of a polynucleotide sequence may be present in the genome of a subject. Copy number variation can be calculated based on "relative copy number" so that apparent differences in gene copy numbers in different samples are not distorted by differences in sample amounts. The relative copy number of a gene (per genome) can be expressed as the ratio of the copy number of a target gene to the copy number of a reference polynucleotide sequence in a DNA sample. The reference polynucleotide sequence can be a sequence having a known genomic copy number. Typically the reference sequence will have a single genomic copy and is a sequence that is not likely to be amplified or deleted in the genome. It is not necessary to empirically determine the copy number of a reference sequence in each assay. Rather, the copy number may be assumed based on the normal copy number in the organism of interest.
Data Analysis
[00113] In some embodiments, one or more pattern recognition methods can be used in automating analysis of genetic data and generating a predictive model. The predictive models and/or algorithms can be provided in a machine readable format and may be used to correlate genetic variants identified in a patient with a disease state, medically relevant trait, or a change in a clinical biomarker measurement. Generating the predictive model may comprise, for example, the use of an algorithm or classifier.
[00114] In some embodiments, a machine learning algorithm is used in generating the predictive model. The machine learning algorithm may comprise a supervised learning algorithm. Examples of supervised learning algorithms may include Average One- Dependence Estimators (AODE), Artificial neural network (e.g., Backpropagation), Bayesian statistics (e.g., Naive Bayes classifier, Bayesian network, Bayesian knowledge base), Case- based reasoning, Decision trees, Inductive logic programming, Gaussian process regression, Group method of data handling (GMDH), Learning Automata, Learning Vector Quantization, Minimum message length (decision trees, decision graphs, etc.), Lazy learning, Instance- based learning Nearest Neighbor Algorithm, Analogical modeling, Probably approximately correct learning (PAC) learning, Ripple down rules, a knowledge acquisition methodology, Symbolic machine learning algorithms, Subsymbolic machine learning algorithms, Support vector machines, Random Forests, Ensembles of classifiers, Bootstrap aggregating (bagging), and Boosting. Supervised learning may comprise ordinal classification such as regression analysis and Information fuzzy networks (IFN). Alternatively, supervised learning methods may comprise statistical classification, such as AODE, Linear classifiers (e.g., Fisher's linear discriminant, Logistic regression, Naive Bayes classifier, Perceptron, and Support vector machine), quadratic classifiers, k-nearest neighbor, Boosting, Decision trees (e.g., C4.5, Random forests), Bayesian networks, and Hidden Markov models.
[00115] The machine learning algorithm may also comprise an unsupervised learning algorithm. Examples of unsupervised learning algorithms may include artificial neural network, Data clustering, Expectation-maximization algorithm, Self-organizing map, Radial basis function network, Vector Quantization, Generative topographic map, Information bottleneck method, and IBSEAD. Unsupervised learning may also comprise association rule learning algorithms such as Apriori algorithm, Eclat algorithm and FP-growth algorithm. Hierarchical clustering, such as Single-linkage clustering and Conceptual clustering, may also be used. Alternatively, unsupervised learning may comprise partitional clustering such as K-means algorithm and Fuzzy clustering.
[00116] In some instances, the machine learning algorithms comprise a reinforcement learning algorithm. Examples of reinforcement learning algorithms include, but are not limited to, temporal difference learning, Q-learning and Learning Automata. Alternatively, the machine learning algorithm may comprise Data Pre-processing. [00117] In certain embodiments, the machine learning algorithms include, but are not limited to, Average One-Dependence Estimators (AODE), Fisher's linear discriminant, Logistic regression, Perceptron, Multilayer Perceptron, Artificial Neural Networks, Support vector machines, Quadratic classifiers, Boosting, Decision trees, C4.5, Bayesian networks, Hidden Markov models, High-Dimensional Discriminant Analysis, and Gaussian Mixture Models. The machine learning algorithm may comprise support vector machines, Naive Bayes classifier, k-nearest neighbor, high-dimensional discriminant analysis, or Gaussian mixture models. In some instances, the machine learning algorithm comprises Random Forests.
[00118] In some embodiments, the predictive model is based on at least one polygenic risk score for a genetic association with a size effect on a clinical biomarker measurement and at least one polygenic risk score for a genetic association with a disease or medically relevant trait, wherein a combined risk score is calculated (see Examples). Such combined polygenic risk scores generally better predict the risk of an individual developing the disease or the medically relevant trait than the separate polygenic risk scores.
System and Computer Implemented Methods for Predicting the Risk of an Individual Developing a Polygenic Disease or Medically Relevant Trait
[00119] In a further aspect, the invention includes a computer implemented method for predicting the risk of an individual developing a polygenic disease or medically relevant trait. The computer performs steps comprising a) receiving genome sequencing data for an individual; b) identifying variant alleles present in the genome of the individual from the genome sequencing data; c) calculating at least one polygenic risk score based on the variant alleles present in the individual using a database comprising correlation data for associations between genetic variants and diseases or medically relevant traits based on genome-wide testing of a population for genetic variants associated with the disease or the medically relevant trait, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait; and d) displaying information regarding the risk of the individual developing the disease or the medically relevant trait.
[00120] In certain embodiments, the individual has a plurality of variant alleles selected from Tables 5-10 and 13. In certain embodiments, the database comprises correlation data between genetic variants and clinical biomarkers, diseases, and medically relevant traits, wherein the correlation data is selected from Tables 4-10 and 13.
[00121] In certain embodiments, the computer implemented method further comprises: a) generating a predictive model using one or more algorithms, wherein the predictive model is based on at least one PRS for a genetic association with a size effect on a clinical biomarker measurement and at least one PRS for a genetic association with a disease or a medically relevant trait; and b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS. In certain embodiments, one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm. For example, a machine learning algorithm may be used including without limitation a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
[00122] In certain embodiments, the computer implemented method further comprises storing the information regarding the risk of the individual developing the disease or the medically relevant phenotypic trait in a database.
[00123] The method can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware. The disclosed and other embodiments can be implemented as one or more computer program products, i.e. , one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, a data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or any combination thereof.
[00124] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
[00125] In a further aspect, a system for performing the computer implemented method, as described, is provided. Such a system includes a computer containing a processor, a storage component (i.e., memory), a display component, and other components typically present in general purpose computers. The storage component stores information accessible by the processor, including instructions that may be executed by the processor and data that may be retrieved, manipulated or stored by the processor.
[00126] The storage component includes instructions. For example, the storage component includes instructions for predicting the risk of an individual developing a disease or medically relevant phenotypic trait based on analysis of genomic sequencing data stored therein. The computer processor is coupled to the storage component and configured to execute the instructions stored in the storage component in order to receive genome sequencing data and analyze the data according to one or more algorithms, as described herein. The display component displays information regarding the risk of the individual developing the disease or the medically relevant trait..
[00127] The storage component may be of any type capable of storing information accessible by the processor, such as a hard-drive, memory card, ROM, RAM, DVD, CD-ROM, USB Flash drive, write-capable, and read-only memories. The processor may be any well-known processor, such as processors from Intel Corporation. Alternatively, the processor may be a dedicated controller such as an ASIC.
[00128] The instructions may be any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor. In that regard, the terms "instructions," "steps" and "programs" may be used interchangeably herein. The instructions may be stored in object code form for direct processing by the processor, or in any other computer language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance.
[00129] Data may be retrieved, stored or modified by the processor in accordance with the instructions. For instance, although the system is not limited by any particular data structure, the data may be stored in computer registers, in a relational database as a table having a plurality of different fields and records, XML documents, or flat files. The data may also be formatted in any computer-readable format such as, but not limited to, binary values, ASCII or Unicode. Moreover, the data may comprise any information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories (including other network locations) or information which is used by a function to calculate the relevant data.
[00130] In certain embodiments, the processor and storage component may comprise multiple processors and storage components that may or may not be stored within the same physical housing. For example, some of the instructions and data may be stored on removable CD- ROM and others within a read-only computer chip. Some or all of the instructions and data may be stored in a location physically remote from, yet still accessible by, the processor. Similarly, the processor may comprise a collection of processors which may or may not operate in parallel. Kits
[00131] Kits are also provided for carrying out the methods described herein. In some embodiments, the kit comprises software for carrying out the computer implemented methods for predicting the risk of an individual developing a disease or medically relevant trait, as described herein. In some embodiments, the kit comprises a diagnostic system for predicting the risk of an individual developing a disease or medically relevant trait, as described herein. In some embodiments, the kit further comprises a container for collecting a DNA sample from an individual. The kit may also include reagents for purifying and/or sequencing a DNA sample.
[00132] In addition, the kits may further include (in certain embodiments) instructions for practicing the subject methods. These instructions may be present in the subject kits in a variety of forms, one or more of which may be present in the kit. For example, instructions may be present as printed information on a suitable medium or substrate, e.g., a piece or pieces of paper on which the information is printed, in the packaging of the kit, in a package insert, and the like. Another form of these instructions is a computer readable medium, e.g., diskette, compact disk (CD), flash drive, and the like, on which the information has been recorded. Yet another form of these instructions that may be present is a website address which may be used via the internet to access the information at a removed site.
EXPERIMENTAL
[00133] The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the present invention, and are not intended to limit the scope of what the inventors regard as their invention nor are they intended to represent that the experiments below are all or the only experiments performed. Efforts have been made to ensure accuracy with respect to numbers used (e.g. amounts, temperature, etc.) but some experimental errors and deviations should be accounted for. Unless indicated otherwise, parts are parts by weight, molecular weight is weight average molecular weight, temperature is in degrees Centigrade, and pressure is at or near atmospheric.
[00134] All publications and patent applications cited in this specification are herein incorporated by reference as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference.
[00135] The present invention has been described in terms of particular embodiments found or proposed by the present inventor to comprise preferred modes for the practice of the invention. It will be appreciated by those of skill in the art that, in light of the present disclosure, numerous modifications and changes can be made in the particular embodiments exemplified without departing from the intended scope of the invention. All such modifications are intended to be included within the scope of the appended claims.
Example 1
Quantifying the Genetic Basis of Serum and Urine Laboratory Phenotypes
Introduction
[00136] Laboratory phenotypes are a primary clinical means of diagnosing metabolic and cardiovascular traits and serve as an important monitor to their ongoing care. As such, understanding the possible confounders and predisposition to particular phenotype measurements has implications for all aspects of disease treatment. To this end, the UK Biobank has performed laboratory testing of more than 30 proteins, metabolites, and modifications in serum and urine on a cohort of more than 480,000 individuals. This cohort has extensive phenotype and genome-wide genotyping data available.
[00137] Here, we build on previous genetic efforts across the 38 lab phenotypes to: 1) perform a systematic characterization of genetic architecture in >300,000 individuals including protein- altering and protein-truncating, non-coding, HLA, and copy number variants; 2) build phenome-wide associations for specifically implicated genetic variants; 3) evaluate causal relationships with 56 diseases and 90 medically relevant phenotypes; and 4) build prediction models from genome data.
Results
Phenotype distributions
[00138] We aimed to examine the consistency of the measurements themselves. Despite extensive quality control and significant effort to avoid batch effects, even subtle deviations from expectation can be highly significant in biobank scale datasets 1. First, we assessed the impact of medication status for individuals that were on lipid-lowering medication at baseline (2006-2010) by estimating changes in biomarker measurements between visits (20,000 subset of individuals that returned for a repeated assessment, Methods). Overall, the estimated adjustments typically agreed with literature estimates 2 (Table 1). After adjusting for medication status, we fit a regression model for 127 covariates including age, sex, urine and serum processing metrics, fasting time, and estimated sample dilution factor, along with numerous relevant interactions (see Methods). These covariates explained a range of 1 % (Rheumatoid factor) to 81 % (Testosterone) of the phenotypic variance observed (FIGS. 6, 7A- 7C). Because estradiol, microalbumin in urine, and rheumatoid factor had a high proportion of values below the lower reportable range (80%, 68%, and 91%, respectively), which is to be expected given the age range of the UK Biobank population, we considered these values as ‘naturally low’ rather than missing, and treated these phenotypes as binary if they were above certain levels (higher than 212 pmol/L for estradiol 3, higher than 40 mg/L for microalbumin in urine, and higher than 16 lU/mL for rheumatoid factor) 4.
[00139] Furthermore, we derived the phenotype urine albumin-to-creatinine ratio (UACR, higher than 30 mg/g) 5 indicative of chronic kidney disease. Taking all the 38 lab phenotypes together we recover phenotype correlations previously estimated (Table 2, FIG. 8) 6 .
Comparison of self-reported, diagnosed, medication, and lab-derived disease status
[00140] When comparing between studies, disease status is often evaluated in disparate ways and it is the challenge of many studies to reconcile the differences between reporting methods. In addition to lab results, UK Biobank has extensive self-reported disease and diagnosis, nurse interviews with participants, and inpatient and soon-to-be-released primary care diagnosis and medication codes. This affords a unique opportunity to evaluate the overlap between such measures in the definition of complex traits.
[00141] As a proof of principle, we used type 2 diabetes as our clinical outcome. Type 2 diabetes is characterized by progressive loss of insulin sensitivity and is diagnosed through HbA1c, a modification to red blood cells induced by long term exposure to high serum glucose. We compared self-reported, nurse collected diagnosis, medication (sulfonylureas, metformin, and other oral antidiabetic drugs), and serum glucose and HbA1c as measures of diabetes. As expected, HbA1c levels, regardless of residualization or adjustment for statins, were well correlated, and thresholded HbA1c (>48 mmol/mol or 6.5%, the clinical threshold for type 2 diabetes) was also similar (Pearson r = 0.72 with residualized, statin adjusted HbA1c). Glucose levels were not as predictive of diabetes status as HbA1c, and drug status (using definitions from Eastwood et al, excluding insulin) were similar to diabetes itself8 (Pearson r = 0.79 with Eastwood et al defined diabetes). Diagnosed diabetes was not similar and we recommend using Eastwood et al.’s definition in the future (maximum r = 0.48 with Eastwood et al. define type 2 diabetes, FIG. 9, Table 3).
Genetics of laboratory phenotypes
[00142] We performed association analysis between autosomal genetic variants and 38 lab measures in 318,984 unrelated White-British individuals and stratified the association into three bins: 1) protein-truncating (27,816), 2) protein-altering (87,407), and 3) non-coding (1000 Genomes Phase 3 variants in Haplotype Reference Consortium [HRC], 9,444,561 10) (FIG. 2A). Comparison of effect sizes estimated across 42 comparison studies with 25 of the lab phenotypes show overall high agreement (correlation greater than 0.5 for 33 comparisons, FIG. 10, Table 4 for comparison).
[00143] We adjusted the nominal association p values separately for each annotation bin using the Bonferroni procedure to correct for multiple hypothesis testing and identified over 7,000 significant associations (Bonferroni p < 1e-7 for coding variants [including protein-truncating and protein-altering], p < 5e-8 for non-coding, FIGS. 11-13, Tables 5-7). Genomic control values for single variant association results were between 1.017 and 1.70 for all 38 phenotypes, suggesting that population structure in our analysis is well-controlled 11.
Coding variants influencing lab phenotypes
[00144] We evaluated the relationship of predicted protein-truncating and protein-altering variants across the 38 lab measures. We find 123 (48 rare, minor allele frequency [MAF] < 0.01) predicted protein-truncating and 2737 (253 rare) protein-altering alleles associations outside the MHC region (chr6:25477797-36448354, Bonferroni p < 1e-7).
[00145] Genetic variants predicted to shorten the coding sequence of genes— termed protein truncating variants (PTVs)— are typically expected to have large effects on gene function. These variants are enriched for disease-causing mutations 12·13, but some may be protective against disease 14, 15. In this study, we find thirty-three (30 rare; MAF < 0.01) PTVs outside MHC region have large estimated lowering effects (>0.1 sd) across at least one of the biomarkers, including: three PTVs in APOB with a range of strong effects on LDL (1.9-3.4 sd), Apolipoprotein B (2.2-2.8 sd), and triglycerides (1.3 sd); two PTVs in GPT with strong effects on alanine aminotransferase (>1.35 sd); a PTV in IQGAP2 and ALB with strong effects on albumin (>0.27 sd); three PTVs in GPLD1 (>0.85 sd) and a PTV in ALPL with effects on alkaline phosphatase (2.35 sd); PTVs in APOA5 (0.40 and 0.56 sd), CHFT8 (0.41 and 0.40 sd), and LCAT (1.34 and 1.48 sd) with effects on Apolipoprotein A and HDL; PTVs in ZNF229 (0.27 sd) with effects on Apolipoprotein B and PTV in PDE3B with effects on Apolipoprotein B and triglycerides (0.27 and 0.40 sd, respectively); a frameshift indel in CST3 with a strong effect on Cystatin C levels (3.3 sd); a PTV in SAG with effects on direct bilirubin (0.54 sd); PTVs in SLC22A2 (0.34 sd) and RNF186 (0.26 sd) with effects on eGFR; PTVs with strong effects in RHAG (0.79 sd) and G6PC2 on glucose and HbA1c (0.27 and 0.17, respectively); PTV in MSR1 with effects on IGF1 (0.11 sd); Two PTVs in LPA with effects on Lipoprotein A levels (0.36, 0.39 sd); PTV in TNFRSF13B with effect on non-albumin protein (0.49 sd); PTVs in ANGPTL8 and LPL with effects on triglycerides (0.46 and 0.20 sd); PTVs in DRD5, PDZK1 and SLC22A12 with strong effects on urate levels (0.13-1.13 sd); and a PTV in INSC with effect on vitamin D levels (0.35 sd) (FIG. 2A, FIG. 11). Twenty-seven (22 rare MAF < 0.01) PTVs outside MHC region have large estimated raising effects (>0.1 sd, Table 5) across at least one of the biomarkers, including: PTV in LIPC, PDE3B, and LPL with effects on Apolipoprotein A and HDL (> 0.18 sd); PTVs in FUT2 and RAP1 GAP with effects on alkaline phosphatase (0.12 sd); PTV in ABCG8 with effect on cholesterol (0.21 sd); PTVs in RNF186 and SLC22A2 with effect on creatinine (0.35 and 0.50 sd, respectively); PTVs in SLC01B1, UGT1A10 with effects on direct and total Bilirubin (0.37, 0.34, 0.40 sd, respectively); PTVs in RORC, SIGLEC1, and UPB1 with effects on gamma glutamyltransferase (0.21 , 0.11 , and 0.32 sd, respectively); PTV in ANGPTL8 with effect on HDL (0.42 sd); PTVs in SLC22A1 and SLC22A2 with effects on lipoprotein A levels (0.38 and 1.19 sd, respectively); PTVs in COL4A4 with effects on microalbumin in serum and urine (1.95 and 0.68 sd), and urine albumin to creatinine ratio (2.29 sd); PTV in HSPA6 with effect on total protein and non albumin protein (0.14 sd); PTVs in APOA5 with effect on triglycerides (0.79 and 0.96 sd); PTVs in PYGM and SLC22A11 with effect on urate (0.16 and 0.14 sd); and PTVs in APOB, DHCR7, FLG, and NPFFR2 with large effects on vitamin D levels (0.89 and 0.11-0.19 sd, respectively).
[00146] Similarly, there were 264 (154 rare; MAF < 0.01) and 202 (109 rare; MAF < 0.01) protein-altering variants outside MHC region that have large estimated lowering and raising effects (>0.1 sd) across at least one of the biomarkers, respectively (FIGS. 2A, 12, Table 6). These include a 0.05% MAF in frame deletion in GOT1 with a 2.4 standard deviation effect on aspartate aminotransferase; multiple rare protein-altering variants in ABCA1 associating with large effects on HDL; a 0.2% MAF missense allele impacting the enzyme Acetyl-CoA carboxylase 2 ( ACACB ) with LDL, triglyceride, ApoB, and alkaline phosphatase lowering effects; multiple independent coding variants in ALPL associated with large alkaline phosphatase lowering effects; a 0.1% rare missense allele in the enzyme carnitine palmitoyltransferase 1A with a strong triglyceride lowering effect; multiple coding alleles in GPT, the alanine aminotransferase 1 gene with strong alanine aminotransferase effects ; a 0.1 % rare missense allele in HKDC1, the enzyme Hexokinase domain containing 1 , with lowering effects on HbA1c; a 0.1 % rare missense allele in SLC34A3 with strong lowering effect on eGFR and raising effect on creatinine and Cystatin C; a 0.06% missense allele in somatostatin, SSTR5 with a strong raising effect on IGF1 ; a 0.25% rare missense allele in CUBN with over 1 sd raising effect on UACR; multiple rare (MAF < 0.01 %) missense variants in SLC22A12, a urate transporter, with strong urate lowering effects; missense variants in PCSK6 with LDL lowering effects; and a 0.02% rare missense allele in HNF4A with strong lowering effects on SHBG, Testosterone, and apolipoprotein A, among other examples. Together, the results from PTVs and protein-altering variants studied suggest that coding alleles will improve our interpretation of genetic associations. HLA alleles influencing lab phenotypes
[00147] The human leukocyte antigen (HLA) region of the genome is one of the most polymorphic and gene-dense regions of the human genome, with on the order of thousands of alleles for any given gene in the region 16'17. Here, we tested for association between 175 imputed HLA alleles with a frequency of at least 0.1 % in the UK Biobank dataset. We find 626 distinct associations (among 98 unique alleles and 30 unique phenotypes, p < 5e-5) with the 38 phenotypes studied, and after conditional analysis using a Bayesian Model Averaging (BMA) approach (Methods, FIG. 14, Table 8), we find 69 instances (among 37 HLA alleles and 29 phenotypes) that have a posterior probability greater than 0.8 for model selection, lending further credence to these associations (Table 8, FIG. 14). Of note, HLA-B*08:01 (OR = 1.35), HLA-DRB1*03:01 (commonly referred to as HLA-DR3, OR = 1.35), and HLA- DRB1*07:01 (OR = 0.796) were assigned posterior probabilities of 100, 100, and 81.4 respectively for the abnormal rheumatoid factor (above 16 U/mL) phenotype. HLA-DR3 haplotypes are clearly important in disease predisposition for lupus, multiple sclerosis, and type 1 diabetes, and the HLA-B*08:01 allele is in linkage disequilibrium (LD) with the HLA- DRB1 allele (r2 =0.52, FIG. 14) 18. The HLA-DRB1*07:01 allelotype has previously been implicated in asparaginase allergies predisposition 19, and in the UK Biobank dataset, we find evidence of protection against rheumatoid arthritis (p = 1.38e-39, BMA posterior probability = 0.81).
CNVs influencing lab phenotypes
[00148] Copy number variations (CNV) constitute a significant fraction of the genetic differences by affected base pairs between individuals. Here, we compute the associations between 8,274 CNVs at minor allele frequency greater than 0.005% and the 38 lab phenotypes 20. We find 17 associations from 13 contributing CNVs (Bonferroni p <.05/10000, Table 9). Of note, we find a 612-kb deletion in 16p11.2 to be associated with an estimated effect greater than 0.82 s.d. on HbA1c, C-reactive protein, and Cystatin-C lab phenotypes (p < 4.9e-7, FIG. 15A), which has been noted as a microdeletion syndrome characterized by highly penetrant early onset obesity and autism spectrum disorder; it has also been estimated to have large effects on BMI in UK Biobank20, 21.
[00149] We perform aggregate rare-variant burden tests, pooled by gene. Statistical methods for this analysis are the same as for variant-level association; here, the genotype is an indicator variable for an individual having a rare CNV (AF < 0.1 %) overlapping within 10kb of the gene region as defined by HGNC, for 23,598 genes. We find a total of 29 unique 300kb windows containing genetic associations (Bonferroni p <.01/25,000; FIG. 2B, Table 10). Burden of rare CNVs overlapping HNF1B associate with Urea, eGFR, Creatinine, and Cystatin-C (p < 8.73e-13) and are estimated to have large effects on these lab measures (Beta = 0.77, -0.90, 0.93, 0.98 s.d, respectively). HNF1B is a membrane bound transcription factor part of the family of hepatocyte nuclear factors, believed to play a role in nephronal (renal) and pancreatic development. Previous studies have associated mutations in HNF1B with maturity onset diabetes of the young (MODY) and altered kidney function 22. Consistent with its developmental role and clinical associations, the rare CNVs overlapping HNF1B associate with renal/kidney failure in UK Biobank (p=1.01e-7; OR = 4.94, SE = 0.30; FIG. 15B) 23, 24. This mutational burden spans both deletions and whole-gene duplications, suggesting that perturbation in either direction from baseline expression may alter phenotype.
[00150] We find that rare large effect CNVs overlapping GGT5 and CST3 respectively associate with Gamma glutamyltransferase and Cystatin-C (p = 6.3e-11 , 1.58e-25; Beta = 0.84, 2.18 s.d.). GGT5 is key to glutathione homeostasis because it provides substrates for glutathione synthesis 25. Meanwhile, CST3 encodes the Cystatin-C protein, which belongs to the type II cystatin gene family and is a potent inhibitor of lysosomal proteinases 26. In contrast to the burden of variation affecting HNF1B, associations for both GGT5 and CST3 are driven by whole-gene duplication events, in line with the positive direction of effect for these associations. These results highlight the value of high-resolution analysis of copy number variation with potentially large effects on lab measurements.
Global and local heritability of lab phenotypes
[00151] To characterize the heritability of the 38 lab phenotypes we first applied LD-score regression 27. We further applied the Heritability Estimator from Summary Statistics (HESS), an approach for estimating the phenotype variances explained by all typed SNPs at a single locus in the genome while accounting for LD among the SNPs 28·29. We find that both LD-score regression and HESS find that common SNPs explain a large fraction of the heritability (0.38% to 18.49% across the studied phenotypes, FIG. 2D). We compare the polygenicity of all 38 lab phenotypes by computing the fraction of total SNP heritability attributable to loci by the top 1% of SNPs. We find that 10 phenotypes have more than 25% of the heritability explained by the top 1 % (Lipoprotein A 44%, Total and Direct bilirubin 31 and 30%) and the remaining 28 phenotypes show patterns of high polygenicity (FIG. 2C, Table 11 , FIG. 16).
Cell type decomposition of genetic effects and GREAT enrichment
[00152] We used tissue and cell type expression data to assess whether SNPs within epigenetic annotations of a given tissue/cell type are enriched for heritability 9. Overall, we find that 20 of the 38 lab phenotypes are at least 20-fold enriched (p < 1e-5) in either kidney or liver, highlighting the primary role of these tissues in the phenotypes we studied (FIG. 2E). Similarly, we find that 4 (Glucose, HbA1c, Urate, and Apolipoprotein B) of the 38 lab phenotypes are at least 15-fold enriched (p < 1e-5) in pancreas (FIG. 17). We examined the individual annotations that comprised the pancreas, liver, and kidney ChIP-seq experiments in the cell type groups and observed broadly consistent enrichments (FIG. 18). Further integration with single cell data may help refine the enrichment across these bulk tissues 30.
[00153] To assess the biological relevance of genes in regions proximal to the associated variants in animal models we applied the genomic region enrichment analysis tool (GREAT) 31 to the mouse genome informatics (MGI) phenotype ontology. For each of the 38 lab phenotypes, we found an enrichment for the mouse ontology consistent with the phenotypic description of the biomarker (p < 1e-7, Table 12).
Targeted phenome-wide association study
[00154] We performed a phenome-wide association analysis (PheWAS) to detect whether the variants that have been implicated may impact other diseases or commonly measured phenotypes (Table 13, FIGS. 19 and 20). We find a total of 218 associations across 86 phenotypes for 25 protein-truncating and 80 LD-independent protein-altering variants that were also associated with increased risk for 68 disease outcomes, and lower risk of 39 disease outcomes (p < 1e-5). Overall, these results demonstrate that variants with effects on lab measures have pleiotropic effects across diverse phenotypes.
Correlation of genetic effects between biomarkers, diseases, and medically relevant phenotypes
[00155] Given the widespread polygenicity and pleiotropy observed in the GWAS and PheWAS analysis, we then estimated global genetic correlation patterns between the biomarkers, diseases, and medically relevant phenotypes.
[00156] First, we applied LD-score regression to estimate genetic correlation 32 between the 38 lab phenotypes. Across the 703 pairwise combinations we find that 203 have significant non-zero correlation of genetic effects (p < 1e-3, FIG. 21), and strong correlation between normalized phenotypes not adjusted and adjusted for lipid-lowering therapy (FIGS. 22, 23).
[00157] Second, we applied LD score regression to an additional 146 summary statistics including 56 diseases and 90 previously published medically relevant phenotypes (FIG. 3A, Table 14). Overall, we find that between the 38 lab phenotypes and the 146 other phenotypes, there exist 1127 significant non-zero correlations of genetic effects (p < 1e-3). Causal inference
[00158] The patterns of significant correlation of genetic effects between the 38 lab phenotypes and 123 diseases and medically relevant phenotypes raised the possibility that some of these associations may be causally relevant.
[00159] First, to estimate causal effects we used the MR-Egger to perform Mendelian Randomization, where we used as instrumental variables the genome-wide significant variants for each trait 33. Using MR-Egger we find 86 causal relationships at an FDR of 10%, many of which are causal relationships to disease outcomes (Table 15).
[00160] It has been noted that Mendelian Randomization is confounded by genetic correlations reflecting shared etiology, and sometimes instruments with pleiotropic effects may introduce limitations in evaluating causal relationships. Hence, to distinguish between genetic correlation from causation we used the O’Connor and Price Latent Causal Variable (LCV) Model 34. Overall, we find that 49 of the 86 causal relationships inferred by Mendelian Randomization are recovered by LCV (FIG. 3B). 344 additional causal relationships are unique to LCV, highlighting potentially novel causal associations (Table 15). Many of these are well described -- such as that of LDL on coronary artery disease and angina, which we estimate at 0.34 log odds change per standard deviation (LCV causal percent 0.75) and 0.313 LOR/SD (LCV genetic causal percent [GCP] 0.8) respectively. Our large sample size provides unique genetic evidence supporting the effect of calcium levels on kidney stones (0.625 LOR/SD and GCP 0.48), consistent with existing epidemiological reports 35. We also discovered novel associations, such as that of AST levels on hernia (-0.319 LOR/SD, GCP 0.14).
Polygenic prediction within and across populations
[00161] A unique opportunity to build predictive models from genome data alone for the 38 lab phenotypes exists from these large-scale population datasets 36. In high-dimensional regression problems, we have a large number of predictors, and identifying the subset that has a relationship with the response is useful for prediction. Here, we used an iterative version of the Lasso 37 38, a commonly used method for simultaneous estimation and variable selection (Methods). For all the 38 lab measurements, we split the White British cohorts into 60% training, 20% validation, and 20% test set, and evaluated the trained model with a common measure for goodness-of-fit, R or receiver operator curves-area under the curve for the binary phenotypes (ROC-AUC, FIG. 4A, Table 16). The inclusion of the lasso coefficients for all these phenotypes is shown in the form of “lake” plots showing shrinkage and variable selection in the Supplement (FIG. 24). We observed the concordance between the predicted and observed phenotypes (FIG. 4B-D). We found two clusters of individuals in the predicted Lipoprotein A level, which reflects the huge contribution of LPA gene (FIG. 24A). To assess whether the prediction performance could translate to other populations we compared the predicted genetic scores to the measured and derived values across the 38 lab phenotypes in 24,131 Non-British White, 6,951 South Asian, 1 ,950 East Asian, and 6,056 African individuals in the UK Biobank. Overall, we estimate the median of 6.4%, 30.0%, 43.2%, and 72.0% reduction in predictive performance to these population, respectively, possibly limiting the potential portability of the models in these populations (FIG. 25, Tables 16,17) 39.
[00162] Finally, we applied a PRS-PheWAS and asked whether any enrichment of disease prevalence in the UK Biobank is observed at the tails of the distribution of predicted values (FIG. 5A). We calculated the top and bottom 0.1% and 0.1-1% bins of White British individuals by polygenic score and compared them to the 40-60% center of the distribution with a Fisher exact test.
[00163] Next, we took these same quantile bins, further divided, and averaged the prevalence of diseases within the White British. We observed strong associations between polygenic score for the biomarkers and type 2 diabetes beyond T2D PRS alone (FIG. 5B). We found that the highest scores correspond to a nearly 80% type 2 diabetes prevalence (compared to -22% with the T2D PRS alone) in a model derived completely from polygenic scores. The same was true for kidney failure, where combination of the trait PRS with those of our labs resulted in the highest risk individuals having a nearly 30% prevalence (compared to 5%, FIG. 5C).
Multiple regression with polygenic risk scores for laboratory tests improves prediction of traits and diseases
[00164] Finally, we hypothesized that, particularly for traits and diseases with small sample sizes, using polygenic risk scores from biomarkers in addition to the score for the trait itself might improve prediction. We began by testing liver fat percentage (LFP), a quantitative measure derived from costly MRI images of the liver. Liver fat is driven by a combination of alcohol use and metabolic disorder40. Only 4,617 individuals thus far have quantified LFP in UK Biobank.
[00165] First, we ran ordinary least squares, predicting LFP from covariates (including alcohol and interactions; see Methods). Covariates were moderately effective at predicting liver fat percentage (adjusted r-squared 0.024, Table 18), and adding our SNPnet-derived polygenic risk score increased the predictive power substantially (FIG. 5D, Tables 18,19, adjusted r- squared 0.050, F test p < 1 e- 10) . Adding the 38 biomarker PRSs to the regression improved predictive capacity further (0.087, F test vs LFP PRS alone p < 1e-10, Tables 18,19). Interestingly, the PRSs for Alanine aminotransferase, sodium in urine, urate, SHBG, and triglycerides all had significant coefficients (Tables 20), supporting the previously described notion that complex interplay between organ systems might contribute to LFP 41.
[00166] We further examined T2D, myocardial infarction (Ml), acute myocardial infarction (AMI), kidney and liver cancer, all of which had improved predictive power in the presence of the biomarker PRSs (FIG. 5F, Tables 21), but which varied in total predictive power and in improvement from inclusion of the biomarkers. Surprisingly, even in the case of type 2 diabetes, where we used a PRS derived from almost 900,000 individuals, there were still substantial gains in performance through the use of biomarkers, which persisted after the removal of HbA1c and Glucose PRSs 42. This suggests that multiple regression of polygenic risk for laboratory tests might capture multiple underlying disease states, similar to what has recently been reported in PRS of disease 43.
Discussion
[00167] By drawing on data from 38 directly measured or derived lab phenotypes in over 300,000 individuals in the UK Biobank study, we provide a systematic assessment of genetic variant associations, their disease relevance, and predictive performance. Furthermore, this study indicates that a subset of the 38 lab phenotypes play a causal role in disease predisposition.
[00168] We find strong enrichment of heritability across the genome to kidney and liver tissue and to mouse knockout ontology terms. Furthermore, we estimate global and local heritability, suggesting different patterns of genetic architecture across the 38 phenotypes.
[00169] Detailed analysis of HLA alleles, copy number, and protein-altering and protein- truncating variants highlight potential drug targets including variants with moderate to strong effects (> 0.1 SD effect), which are quite rare from the analysis of common non-coding variation.
[00170] Predictive models including polygenic risk scores for biomarkers in addition to trait PRS highlight the potential that exists in deriving joint predictive models based on training on multiple responses, which we anticipate will improve resolution in dissecting drivers of disease risk in an individual. Integration with independent population biobanks will help elucidate the extent to which these combined risk models can be transferred.
[00171] The genome-wide resource made available with this study provides a starting point for cataloging variants affecting the 38 lab phenotypes, and larger datasets across underrepresented populations may improve their clinical relevance. These results highlight the benefits of direct biomarker measurements for interpretation of genetic variation. Methods
Genotype and phenotype data
[00172] We used genotype data from the UK Biobank dataset release version 2 and the hg19 human genome reference for all analyses in the study44. To minimize the variability due to population structure in our dataset, we restricted our analyses to unrelated individuals based on the following four criteria reported by the UK Biobank in the file“ukb_sqc_v2.txt”:
1. used to compute principal components (“used_in_pca_calculation” column)
2. not marked as outliers for heterozygosity and missing rates (“het_missing_outliers” column)
3. do not show putative sex chromosome aneuploidy (“putative_sex_chromo- some_aneuploidy” column)
4. have at most 10 putative third-degree relatives (“excess_relatives” column).
[00173] Using self-reported ancestry (UK Biobank field ID: 21000), we analyzed 5 subpopulations in the study : White British (n = 337,151 individuals), African (6,497), East Asian (2,061), South Asian (7,363), and Non-White British (26,471). We subsequently focused on a subset of individuals with non-missing values for covariates as described below.
[00174] We annotated variants using the VEP LOFTEE plugin (github.com/konradjk/loftee) and variant quality control by comparing allele frequencies in the UK Biobank and gnomAD (gnomad.exomes.r2.0.1. sites. vcf.gz) as previously described 15. We focused on variants outside of major histocompatibility complex (MHC) region (chr6:25477797-36448354) and performed LD-pruning using PLINKwith "--indep 50 5 2" as previously described 15·45. The LD- pruned sets are used for targeted PheWAS analysis described below.
[00175] We focused on 32 laboratory measurement phenotypes (UKBB field ID column in Table 2) and also defined two derived phenotypes, estimated glomerular filtration rate (eGFR) and non-albumin proteins. The eGFR measure is an indicator of renal function and is defined by the CKD-EPI equation 46. We defined non-albumin protein as the difference between the total protein and albumin. Given the dominance of signals below the detection limit for some laboratory measures, we additionally defined four binary phenotypes:
Urine albumin to creatinine ratio higher than 30 mg/g
Microalbumin higher than 40 mg/L
Rheumatoid factor higher than 16 lU/mL
Estradiol higher than 212 pmol/L [00176] We treated individuals beyond the detection limit for those laboratory measurements as cases in those four binary phenotypes, and below the detection limit as controls, as reported by the corresponding reportability fields.
Statin identification and LDL adjustment
[00177] We reviewed the medications taken by one or more participants in the UK Biobank and identified 13 medication codes corresponding to statins (1141146234, atorvastatin; 1141192414, crestor 10mg tablet; 1140910632, eptastatin; 1140888594, fluvastatin; 1140864592, lescol 20mg capsule; 1141146138, lipitor 10mg tablet; 1140861970, lipostat 10mg tablet; 1140888648, pravastatin; 1141192410, rosuvastatin; 1141188146, simvador 10mg tablet; 1140861958, simvastatin; 1140881748, zocor 10mg tablet; 1141200040, zocor heart-pro 10mg tablet). We then identified 1 ,427 participants with laboratory test data who were not taking a statin upon enrollment (years 2006-2010), but who were taking a statin at the time of the first repeat assessment visit (years 2012-2013). For each participant, we divided their on-statin laboratory test measurement with their pre-statin laboratory test measurement. The mean of this value was considered to be the statin correction factor within the UK Biobank. For all individuals who were taking statins upon enrollment, we divided their on-statin measurement by the correction factor to yield an adjusted laboratory test value. For all traits, we calculated a P value from a wilcoxon signed rank test for paired samples comparing whether the pre- and on-statin values were significantly different, and only traits with a significant non-zero effect were adjusted for statins. The following list of statins were identified in the UK Biobank for the purposes of adjusting by the estimated factor: 1140861958, simvastatin; 1140888594, fluvastatin; 1140888648, pravastatin; 1141146234, atorvastatin; 1141192410, rosuvastatin; 1140861922, lipid lowering drug; 1141146138, lipitor 10mg tablet.
Covariates correction
[00178] Raw UK biobank phenotypes for all reported individuals (excluding out of range and QC failed measurements) were fit with linear regression against the 127 covariates. These included demographics (age, sex, age*sex, ageA2), population structure (the top 40 principal components and indicators for each of the assessment centers in the UK Biobank), temporal variation (indicators for each month of participation, with the exception that all of 2006 and August through October of 2010 were considered single months), socioeconomic status indicators (townsend deprivation indices and interactions with age and sex), the genotyping array used, and technical confounders (blood draw time and it’s square and interactions with age and sex; urine sample time and its square and interactions with age and sex; sample dilution factor; fasting time, its square, and interactions with age and sex; and interactions of blood draw time and urine sample time with dilution factor). The residual from this regression was inverse normal transformed using the Blom transform and then used as the tested outcome. For sensitivity analysis, we also applied covariate transformation with just the White British individuals and obtained similar estimates.
Genome-wide association analysis
[00179] We performed association analyses using imputed 1000 Genomes Phase I variants (for non-coding variants), directly genotyped variants on array (for protein-truncating and protein-altering variants), HLA allelotypes, and copy number variations (CNVs).
GWAS of imputed 1000 Genomes Phase I variants
[00180] We employed a GWAS without covariates of the residuals computed above. This was run using plink v2.00al_M with the following parameters:
--glm cols=chrom,pos, ref, alt, altfreq, firth, test, nobs, orbeta,se,ci,t,p hide-covar --pgen <imputed PGEN> --remove <non-White British individuals» -keep <all individuals, males, or females» -geno 0.1 -hwe 1e-50 midp;
[00181] For binary traits, a logistic glm was fit directly with covariates to age, sex, genotyping platform, 10 PCs, ageA2, and fasting time, with the— vif 999 parameter to avoid the collinearity of age and ageA2. Sex was excluded for sex-specific GWAS.
[00182] We further ran a subset of continuous traits on array variants directly in plink2- 20190402 with the 127 covariates (and -vif 9999 to avoid collinearity concerns); both resulted in heritability estimates equivalent to the globally covariate corrected results.
Derivation of independent loci
[00183] Once we ran the GWAS, full summary statistics were clumped to rA2 > 0.1 using the following clump command:
plinkl .9— bfile <1000G Phase 3 European plink file» -clump «summary statistics» -clump- pi 1e-6 ~clump-p2 1e-4 -clump-r2 0.1 -clump-kb 10000 -clump-field P -clump-snp-field ID [00184] Then these were further filtered such that any SNPs within 0.1 cM of each other were considered part of the same association signal, with the cM annotation derived from the 1000G Phase 3 European samples (n = 489) 27 - variants within 0.1 cM were chose to only have the minimum p-value. For the final results, all lead variants with p < 5e-8 were kept for the mendelian randomization analyses.
[00185] In order to report independent signals, we ran the following plink command:
plink -bfile <1000G Phase 3 European plink file» -extract <all unique hit SNPs, n = 6269> -indep 50 5 2 And counted the number of independent SNPs it reported.
GWAS on coding variants on genotyping array
[00186] Univariate association analyses for single variants were applied to the 38 phenotypes independently using PLINK v2.00al_M (2 April 2019). For binary phenotypes, we performed Firth-fallback logistic regression as previously described 15. For the residuals of the quantitative phenotypes after adjusting the 127 covariates, we applied generalized linear model association analysis. Using the p-value threshold of 1e-7, we identified significant associations across 123 and 2,736 variants for protein-truncating variants and protein-altering variants, respectively.
Cascade plot visualization of coding and non-coding variants
[00187] We visualized the minor allele frequency and effect size estimates in a series of cascade plots. For protein-truncating and protein-altering variants, we focused on genome wide significant associations with p < 1e-7 and annotated the corresponding gene symbols for variants with absolute value of betas greater than 0.1 (outliers). For non-coding variant associations characterized on the imputed 1000 Genomes Phase I variants, we focused on the clumped set of associations (described above in“Derivation of independent loci” section) and applied the following procedure to determine and highlight the outliers:
Fit an univariate linear regression model with absolute value of effect size estimate (BETA or log odds ratio) as the response and the log of minor allele frequency as the predictor.
Find the residuals from the regression model and find the mean and standard deviation of the residuals.
[00188] We defined association is an outlier on cascade plot if and only if the residuals from the regression model above is outside of the mean plus or minus 1 SD range.
Association and Bayesian model averaging analyses for HLA allelotypes
[00189] The HLA data from the UK Biobank contains all HLA loci (one line per person) in a specific order (A, B, C, DRB5, DRB4, DRB3, DRB1 , DQB1 , DQA1 , DPB1 , DPA1). We downloaded these values, which were imputed via the HLA:IMP*2 program (Resource 182 - CITE); the Biobank reports one value per imputed allele, and only the best-guess alleles are reported. We filtered all 362 alleles from the Biobank based on whether or not alleles are present in more than 0.1% of the population surveyed; 175 out of the 362 alleles result from this filter. [00190] We performed association analysis for our 38 phenotypes and the 175 H LA alleles using PLINK v2.00aLM (2 April 2019). We subsetted our data to include only white British individuals (n = 337,151). We used age, sex, and the first four genotype principal components as covariates, running generalized linear models for quantitative (inverse-normalization transformed) traits and generalized linear models with a Firth-fallback method for binary traits.
[00191] As a method to identify the HLA alleles that were not simply associated to a particular phenotype due to LD, we used the Bayesian Model Averaging (BMA) technique, implemented in the‘bma’ R package [cran.r-project.org/web/packages/BMA/BMA.pdf] Bayesian Model Averaging is a model selection method that trains a variety of models, one on each possible subset of alleles. The posterior probability of each model being the correct one given the data is determined, and subsequently, a BIC per model is calculated. The degree to which an allele is included across models (posterior probability) is then deemed a measure of confidence in the association between allele and phenotype.
[00192] We first filtered the allele dosage file to those columns that were not sparse, making sure that each allele in the analysis had more than 5 entries. We then identified all of the allele- phenotype pairs that had BY-adjusted p -values less than 0.05 from the PLINK analysis. If there were more than 10 alleles below this threshold for a given phenotype, we used the 10 alleles with the lowest adjusted p -values in order to maintain computational tractability. If there were less than two such alleles for a given phenotype, we did not run BMA for that phenotype. These requirements filtered our testing base down to 33 phenotypes, with 56 alleles included in at least one analysis.
[00193] In order to maintain computational tractability, only all models whose posterior model probability was within a factor of 1/5 of that of the best model were kept for the final analysis. We focused on alleles with posterior probabilities > 0.8 based on our BMA analysis. We ran BMA with a binomial error distribution and link function for binary traits and Gaussian ones for quantitative traits.
[00194] We used the in-built 'imageplot.bma' function to produce the plots of the model architectures, and report allele, phenotype, posterior mean effect size, standard deviation of said effect size, and the posterior probability that the effect is not equal to 0.
Copy number variations
[00195] CNVs were called by applying PennCNV v1.0.4 on raw signal intensity data from each array within each genotyping batch as previously described 20, with the notable difference that here, all analyses are conducted within the white British unrelated cohort described above. Data for phenome-wide associations were derived from UK Biobank data fields corresponding to body measurements, biomarkers, disease diagnoses, and medical procedures from medical records, as well as a questionnaire about lifestyle and medical history. Methods for CNV GWAS and burden testing are as previously described. We compute generalized linear models using the PLINK v2.00a (31 Mar 2018) --glm option. Quantitative traits were rank- normalized prior to analysis, using the --pheno-quantile-normalize flag. No covariates were specified for analysis of quantitative traits, as only the covariate adjusted phenotypes were used for this analysis. For binary traits, we use the Firth-fallback modifier, and use the following as covariates: age, sex, and four marker-based population genetic principal components from UK Biobank’s PCA calculation. For burden tests, we add number and total length of CNV as covariates for both binary and quantitative traits. See the“GWAS on genetic variants on genotyping array” section for further description of PLINK’s implementation of these model specifications.
Heritability estimates
LD score regression
[00196] We used the default LD scores from the 489 unrelated European individuals in 1000 Genomes as our reference. We converted our summary statistics to LDSC format using munge_sumstats, munging against the set of 1000 Genomes Phase I variants with calls of an ancestral allele in 1000 Genomes Phase III. We ran Idsc.py with the following parameters: ldsc.py--h2 <trait summary statistics» --ref-ld-chr <ldsc/1000G.EUR.QC/>
--w-ld-chr <ldsc/weights_hm3_no_hla/weights.>
HESS
[00197] We performed standard stage 1 fitting 28, then removed all regions which contained no SNPs with MAF > 5% (5/-1700 bins genome wide) and generated stage 2 estimates from the resulting matrices. We used the same munged sumstats described above. We confirmed heritability estimates of select associations using GCTA-GREML and genotyped array variants on a subset of individuals (data not shown) to ensure estimates were comparable to this model.
Enrichment analyses of association signals
Cell-type enrichment analysis
[00198] We ran partitioned LD score regression with the 53 baseline annotations and included all 10 cell type annotations and the Roadmap control regions 9. The exact command was: Idsc.py--h2 <trait sumstats» --ref-ld-chr ldsc/1000G_EUR_Phase3_baseline/baseline.,ldsc/1000G_Phase3_cell_type_groups/cell_ty pe_group.1.,ldsc/1000G_Phase3_cell_type_groups/cell_type_group.2.,ldsc/1000G_Phase3_ cell_type_groups/cell_type_group.3.,ldsc/1000G_Phase3_cell_type_groups/cell_type_group. 4.,ldsc/1000G_Phase3_cell_type_groups/cell_type_group.5.,ldsc/1000G_Phase3_cell_type_ groups/cell_type_group.6.,ldsc/1000G_Phase3_cell_type_groups/cell_type_group.7.,ldsc/10 00G_Phase3_cell_type_groups/cell_type_group.8.,ldsc/1000G_Phase3_cell_type_groups/c ell_type_group.9. , ldsc/1000G_Phase3_cell_type_groups/cell_type_group.10. , Idsc
/Idscores/Roadmap/Roadmap. control --w-ld-chr ldsc/weights_hm3_no_hla/weights. overlap-annot— frqfile-chr ldsc/1000G_frq/1000G.mac5eur.
[00199] We also ran each of the 394 annotations independently, including just the Roadmap control and baseline annotations as covariates 47.
GREAT enrichment analysis
[00200] We applied the genomic region enrichment analysis tool (GREAT version 4.0.3) to LD- clumped summary statistics of each trait. We used the mouse genome informatics (MGI) phenotype single knock out ontology, which contains manually curated knowledge about hierarchical structure of phenotypes and genotype-phenotype mapping of mouse 48. We downloaded their ontologies on 2017-09-28 and mapped MGI gene identifiers to Ensembl human gene ID through unambiguous one-to-one homology mapping between human and mouse Ensembl IDs. We removed ontology terms that were labelled as“obsolete”,“bad”, or “unknown” from our analysis. As a result, we obtained 554,948 mapping annotation spanning between 9,408 human genes and 9,149 mouse phenotypes. For the summary statistics of each GWAS analysis, we applied LD clumping with --clump-p1 = 1e-3 and other default parameters in PLINK v1.90b6.7 (2 Dec 2018) 49. We selected the top 5,000 independent significant variants and their ties and performed GREAT enrichment analysis with the minimum and maximum annotation counts of 30 and 500, minimum binomial fold enrichment of 2, and the default parameter as described elsewhere 31. Since we included the non-coding variants in the analysis, we focused on GREAT binomial genomic region enrichment analysis and quantified the significance of enrichment in terms of binomial fold enrichment and binomial p-value and visualized the top 30 enriched terms.
Targeted Phenome-wide association analysis
[00201] We prioritized the following sets of variants for targeted phenome-wide association analysis.
Protein-truncating variants with at least one significant associations (P < 1e-7) with the
38 biomarker phenotypes Protein-altering variants with at least one significant associations (P < 1e-7) with the 38 biomarker phenotypes
[00202] For each sets of variants, we used Global Biobank Engine (GBE) to query significant associations (P < 1e-5) across previously reported binary phenotypes 15·45. We visualized association results for both 38 laboratory measurement traits and GBE traits as heatmaps and sorted variants and phenotypes with hierarchical clustering.
Correlation of genetic effects across relevant phenotypes
[00203] We used LD score regression in genetic correlation mode to estimate genetic correlation effects between biomarkers and other traits. The exact arguments were: Idsc.py - rg <traits> --ref-ld-chr ldsc/1000G.EUR.QC/ -w-ld-chr ldsc/weights_hm3_no_hla/weights.
Causal inference
[00204] We ran LCV with a number of LDSC-formatted summary statistics files. For MR, we used TwoSampleMR to calculate MR Egger regressions and perform trait munging 50. To scale betas from the Neale lab binary trait outcomes, we considered the prevalence of the trait and raw mendelian randomization beta (with units of change of outcome per standard deviation change in the laboratory test), and calculated the log odds ratio as log((prev + beta) / prev). In our simulations (data not shown), this is approximately equivalent to logistic regression across a range of prevalences and betas. All MR and LCV results were jointly adjusted for multiple comparisons using the standard BH FDR algorithm (at 10%). Network visualization of the results was done using Cytoscape 51 ·52.
Mendelian randomization
[00205] Mendelian randomization methods enable estimation of causal effects between an exposure X and an outcome Y. Given a set of genetic instruments of X (i.e. , direct causes of X that are not affected by confounders), the causal effect of X on Y can be extracted by analyzing their associations with both X and Y. Most methods are based on linear models and start with a 2D plot of the association summary statistics. A meta-analysis is then used to estimate if there is a significant correlation between the effects, which then translates into a line whose slope reveals the causal effect. MR-Egger is a powerful method that uses Egger regression for the meta-analysis5. Egger regression was developed originally for correcting publication bias in meta-analyses, but the problem is analogous to adjusting bias from pleiotropy in the MR setting. Thus, Egger-regression provides a way to both estimate and adjust for biases in the 2D plot that originate from pleiotropic effects (under the assumption that the association of each genetic instrument with the exposure is independent of the pleiotropic effect of the variant).
Latent causal variables
[00206] LCV is a recent method that makes use of the MR graphical model to evaluate if an observed genetic correlation can be attributed to a causal relationship11. LCV is based on a 2D analysis of summary statistics as in MR methods, with two notable differences. First, it uses a latent variable to model the mediation of genetic correlation between two traits. This allows for the estimation of the full or partial proportion of genetic causal relation between two traits. Second, it takes as input all summary statistics and does not require a set of independent instruments. On the other hand, unlike MR methods, LCV does not address reverse causality, and it does not estimate causal effect sizes.
[00207] Scripts are provided on our github repository
github.com/rivas-lab/public-resources/tree/master/uk_biobank/laboratory-tests.
Polygenic prediction within and across populations
[00208] We applied batch screening iterative lasso (BASIL) and fit multivariate Lasso regression model 38. For each trait, we randomly split White British individuals with non missing values into 60% training, 20% validation, and 20% test sets. With the 127 covariates described above, we used training and validation set for training using R-snpnet package (github.com/junyangq/snpnet/) with the default parameters for phenotypes after statin adjustments. For microalbumin in urine, we used a simpler model with a limited set of covariates -- age, sex, and the first 10 PCs -- given the smaller number of individuals. For two traits, Lipoprotein A and total bilirubin, we noticed that the snpnet package did not find the optimal lambda within the default maximum number of iterations (100). We took the model from the 100th iteration as the best model among the tested during the training phase.
[00209] Using the beta values for array-genotyped SNPs and covariates from multivariate Lasso regression, we computed polygenic risk score for each individuals with PLINK2 -score subcommand and evaluated the goodness of fit using the test set. Specifically, we computed correlation coefficient R for continuous traits or ROC-AUC metric for binary traits for risk scores quantified from both genotype and covariates and covariates only, and quantified the difference of those two as the increment of predictive performance.
[00210] We applied the same evaluation procedure for the four non-White British populations in the UK Biobank: Non-British White, East Asian, South Asian, and African. We evaluated the transferability of our polygenic risk score within and across ethnic cohorts by comparing the increments of predictive performance between White British and the other four populations.
PRS-PheWAS
[00211] We used R’s fisher.test implementation of the fisher exact test between the 40-60 percentile and the top and bottom 0.1% and 0.1-1%.
Prediction comparison of phenotype and biomarker PRSs
[00212] We used R’s glm implementation and the McFadden’s adjusted PseudoR2 from DescTools (binary outcomes) or R’s Im implementation and reported adjusted r2 (continuous outcomes), along with relevant F tests with the anova command, to evaluate prediction.
Table 1. Estimated adjustment based on statin usage.
For serum lab phenotypes we estimated the additional constant effect (“Additive”) that statins seem to have on the trait once they are started; "Multiplicative" alternatively means the multiplier effect of statins; and the P-value is the Wilcoxon signed rank test for paired samples comparing whether the pre- and on-statin values seem to differ meaningfully.
Figure imgf000055_0001
Table 2. Description of 38 measured and derived lab phenotypes.
Lab phenotype name, abbreviation, units of measurement, the UK Biobank field ID, Global Biobank Engine phenotype ID, whether the phenotype is defined as binary (B) or quantitative (Q), whether the phenotype is adjusted for statin (Y) or not (N), whether the phenotype is adjusted for covariates (Y) or not (N), and total number of unrelated individuals across the White British, Non-British White, African, East Asian, South Asian population subset in UK Biobank, the number of loci identified from GWAS (the number of independent loci, the number of imputed variants on 1000 genome phase 3 MAF > 1 % variants, number of protein-altering variants, number of PTVs, the number of HLA alleles with posterior probability >= 0.8, the number of single CNVs, and the number of rare aggregate CNVs), and GBE URL
Figure imgf000056_0001
Figure imgf000057_0001
Figure imgf000058_0001
Figure imgf000059_0001
Table 3. Diabetes correlation estimates.
For each trait pair, the spearman correlation of the traits across all unrelated White Briitsh individuals for which the traits were defined is presented. AnyAntidiabetic is defined as any non-insulin drug from the oral antidiabetics and metformin codes presented in Eastwood et al; T2D is the definition of type 2 diabetes presented in Eastwood et al; fasting glucose is the glucose measurement for the individuals with a self-reported fasting time between 8 and 24 hours; HighConfDiabetes is a combination of self-report and ICD codes presented in (DeBoever et al. 2018); GenericMetformin is just using Metformin and its generic forms; FamilyHistoryDiabetes is defined as 0 or 1 depending on whether the individual has self- reported a father, mother, or sibling with diabetes; and HbAI c.diabetic is defined as a binary indicator of the individual having a
measured HbA1c greater than 48.
Figure imgf000060_0001
Figure imgf000061_0001
Figure imgf000062_0001
Figure imgf000063_0001
Figure imgf000064_0001
Figure imgf000065_0001
Figure imgf000066_0001
Figure imgf000067_0001
Figure imgf000068_0001
mpared.
e set of r study .
Figure imgf000069_0001
Figure imgf000070_0001
Figure imgf000071_0001
Figure imgf000072_0001
Figure imgf000073_0001
Figure imgf000074_0001
Figure imgf000075_0001
Figure imgf000076_0001
Figure imgf000077_0001
Figure imgf000078_0001
Figure imgf000079_0001
Figure imgf000080_0001
Figure imgf000081_0001
Figure imgf000082_0001
Figure imgf000083_0001
Figure imgf000084_0001
Figure imgf000085_0001
Figure imgf000086_0001
Figure imgf000087_0001
Figure imgf000088_0001
Figure imgf000089_0001
Figure imgf000090_0001
Figure imgf000091_0001
Figure imgf000092_0001
Figure imgf000093_0001
Figure imgf000094_0001
Figure imgf000095_0001
Figure imgf000096_0001
Figure imgf000097_0001
Figure imgf000098_0001
Figure imgf000099_0001
Figure imgf000100_0001
Figure imgf000101_0001
Figure imgf000102_0001
Figure imgf000103_0001
Figure imgf000104_0001
Figure imgf000105_0001
Figure imgf000106_0001
Figure imgf000107_0001
Figure imgf000108_0001
Figure imgf000109_0001
Figure imgf000110_0001
Figure imgf000111_0001
Figure imgf000112_0001
Figure imgf000113_0001
Figure imgf000114_0001
Figure imgf000115_0001
Figure imgf000116_0001
Table 7. Association results for non-coding variants across the 38 lab phenotypes (p < 5e-8).
The non-coding variants characterized on the imputed 1000 Genomes Phase I variants (ID, variant), their positions in centimorgans (CM) and its association to the lab phenotype (trait). Effect size allele (A1 ), estimated effect size (BETA), standard error (SE), p-value of association (P), minor allele frequency (MAF), whether the variant is outside of MHC region (is_outside_of_MHC), gene symbol (Gene Symbol), and absolute value of estimated effect size deviates from the standard deviation range estimated from linear fit between log minor allele frequency and absolute value of estimated effect size (outlier, see methods for more details).
Figure imgf000117_0001
Figure imgf000118_0001
Figure imgf000119_0001
Figure imgf000120_0001
Figure imgf000121_0001
Figure imgf000122_0001
2020/242976 2020/034303
Figure imgf000123_0001
Figure imgf000124_0001
Figure imgf000125_0001
Figure imgf000126_0001
Figure imgf000127_0001
Figure imgf000128_0001
Figure imgf000129_0001
Figure imgf000130_0001
Figure imgf000131_0001
Figure imgf000132_0001
Figure imgf000133_0001
Figure imgf000134_0001
Figure imgf000135_0001
Figure imgf000136_0001
Figure imgf000137_0001
Table 8. (a) HLA alleles found to be associated to the 38 lab phenotypes via both PUNK association tests and Bayesian Model Averaging
(BMA). (b) Other, non-lab phenotypes significantly associated (via PUNK and BMA) to the 37 alleles that had significant results in (a).
Tables enumerate associations’ BETA, SE, T/Z ST AT values (depending on the type of test), P values from PUNK, and the same P values that have been Benjamini-Yekutieli adjusted (BY_ADJ_P). The tables also contain probabilities of model inclusion from BMA. The tables only enumerate those associations that were found to have both a PUNK association p - value <= 0.05/10000 and a BMA posterior probability >= 0.8.
Table 8(a)
Figure imgf000138_0001
Figure imgf000139_0001
Figure imgf000140_0001
Figure imgf000141_0001
Figure imgf000142_0001
Table 9. Copy number variation associated to the 38 lab phenotypes. Bonferroni p < 0.05/10000. Columns in the provided data file correspond to the phenotype, chromosome and centroid position of each CNV tested, CNV ID (formatted as chrom:bp1-bp2_del/dup (del denoted by - and dup by +), reference copy number (always N), alternate CNV (always denoted by +), tested“allele” (usually +), genotype model (ADD is additive), N, estimated beta/log odds ratio, standard error of estimate, t/z-statistic, and p-value.
Figure imgf000143_0001
Table 10. Rare variant CNV test.
Bonferroni p <.01/25000. Columns in the provided data file correspond to the phenotype, chromosome and centroid position of each gene tested, gene name, reference copy number (always N), burden of CNV (always denoted by +), tested“allele” (usually +), genotype model (ADD is additive), N, estimated beta/log odds ratio, standard error of estimate, t/z-statistic, and p-value.
Figure imgf000144_0001
Figure imgf000145_0001
Figure imgf000146_0001
Figure imgf000147_0001
Figure imgf000148_0001
Figure imgf000149_0001
Figure imgf000150_0001
Figure imgf000151_0001
Figure imgf000152_0001
Figure imgf000153_0001
Figure imgf000154_0001
Figure imgf000155_0001
Figure imgf000156_0001
Figure imgf000157_0001
Table 12. Genomic region enrichment analysis tool (GREAT) applied to summary statistic data from 38 lab phenotypes and the mouse genome informatics (MGI) phenotype ontology.
The lab test phenotype (Trait), the enriched mouse phenotype ontology term (OntolotyJermJD, Ontoloty_term), its rank (Rank), -log10(GREAT binomial test P-value) (loglOBPval), the fold change in the GREAT binomial test (BFold), and the link to the Mouse Genome Informatics website for the enriched ontology term (MGMJRL).
Figure imgf000158_0001
Figure imgf000159_0001
Figure imgf000160_0001
Figure imgf000161_0001
Figure imgf000162_0001
Figure imgf000163_0001
Figure imgf000164_0001
Figure imgf000165_0001
Figure imgf000166_0001
Figure imgf000167_0001
Figure imgf000168_0001
Figure imgf000169_0001
Table 13. Association results for the targeted phenome-wide association study (p < 1e-5).
The variant and their ID (Variant, VariantJD) and its association to disease outcomes (Phenotype) with the corresponding Global Biobank Engine phenotype ID (GBEJD). The -Iog10 p-value of association (loglOP), estimated effect size (log odds ratio, LOR), standard error of effect size estimate (SE), Gene Symbol (Gene_symboi), predicted protein-truncating or protein-altering variant (Csq), predicted major consequence (Consequence), whether the variant is outside of MHC region (is__outside_of_MHC), whether the variant is LD independent based on LD pruning (Idjndep), and the URLs for the corresponding pages on Global Biobank Engine (GBE_ variant_page and GBEjohenotypejcage).
Figure imgf000170_0001
Figure imgf000171_0001
Figure imgf000172_0001
Figure imgf000173_0001
Figure imgf000174_0001
Figure imgf000175_0001
Figure imgf000176_0001
Figure imgf000177_0001
Figure imgf000178_0001
Figure imgf000179_0001
Figure imgf000180_0001
Autism disease Autism
Figure imgf000181_0001
Table 15. Causal inference results using MR-Egger and LCV.
Each row represents a significant exposure-outcome pair by either MR-Egger or LCV (FDR 10%). The edge type marks if the causal link was found by MR-Egger only, LCV only, or both. Estimated causal effects are presented for all pairs.
Figure imgf000182_0001
Figure imgf000183_0001
Figure imgf000184_0001
Figure imgf000185_0001
Figure imgf000186_0001
Figure imgf000187_0001
Figure imgf000188_0001
Figure imgf000189_0001
Figure imgf000190_0001
Figure imgf000191_0001
Figure imgf000192_0001
Figure imgf000193_0001
Figure imgf000194_0001
Figure imgf000195_0001
Figure imgf000196_0001
Table 16. Predictive performance of lab phenotypes from genetic data within and across populations.
The laboratory phenotype (Phenotype), whether the phenotype is binary (bin) or quantitative (qt), evaluated population (population), the increments of predictive performance (AUC for binary traits and R for quantitative traits) from covariate-only model to the model with both covariates and genotypes (delta_R_or_AUC), predictive performance measures of the model with genotype and covariates (Genotype_and_covariates), the model with covariates (Covariates_only), and the model with genotypes (Genotype_only), and their trans- populational comparison with respect to White-British population shown in percent (Relative_to_WB_delta_R_or_AUC, Relative_to_WB_Genotype_and_covariates, Relative_to_WB_Covariates_only, and Relative_to_WB_Genotype_only).
Figure imgf000197_0001
Figure imgf000198_0001
Figure imgf000199_0001
Figure imgf000200_0001
Figure imgf000201_0001
Figure imgf000202_0001
Figure imgf000203_0001
Figure imgf000204_0001
Table 17. Population-specific bias in polygenic prediction of the 38 lab phenotypes.
The rank of the increments in predictive performance comparing the PRS model with both genotype and covariates and covariate alone across 5 population groups are summarized. The sum across population for a given rank varies due to the ties in the ranks.
Figure imgf000205_0001
Table 18. Predictive power of multiple regression of laboratory tests. Each trait is treated independently and a regression model (linear or logistic, determined by outcome) is used. McFadden’s adjusted RA2 (for binary outcomes) and Adjusted RA2 (for continuous outcomes) are presented for models which contain just covariates or covariates with the traits of interest. All regressions were run with age, sex, genotyping array, 40 principal components of the genotyping matrix, age squared, townsend deprivation index, and age-sex interaction. Type 2 diabetes additionally had covariates of BMI and Waist to Hip ratio and interactions of each with age and sex, and liver fat percentage has covariates of alcohol and interactions with age and sex.
Figure imgf000205_0002
Figure imgf000206_0001
Table 19. F test for improved predictive performance of liver fat percentage.
F statistics for comparison of the explained variance under the covariate only model versus the trait PRS and combination of all biomarker PRSs, as well as comparisons of each of these with the combined model with all PRSs. We observed a consistent and significant improvement across all model comparisons.
Figure imgf000206_0002
Table 20. Regression coefficients for prediction of liver fat percentage. Regression coefficient terms and their standard errors estimated from individual liver fat percentage. All terms included in the full regression model are present in the table.
Figure imgf000206_0003
Figure imgf000207_0001
Figure imgf000208_0001
Table 21. F test for improved predictive performance of other traits.
F statistics for comparison of the explained variance under the covariate only model versus the trait PRS and combination
of all biomarker PRSs, as well as comparisons of each of these with the combined model with all PRSs, for each of kidney and liver cancer, acute and all myocardial infarction, and type 2 diabetes. All results suggest significant improvement of biomarkers versus trait PRSs alone.
Figure imgf000209_0001
References
[00213] 1. Daniel Fry, Rachael Almond, Stewart Moffat Mark Gordon & Parmesher Singh. UK
Biobank Biomarker Project: Companion Document to Accompany Serum Biomarker Data. UK Biobank Document Showcase (2019).
Available at: biobank.ctsu.ox.ac.uk/showcase/docs/serum_biochemistry.pdf. (Accessed: 4th October 2019)
[00214] 2. Liu, D. J. et al. Exome-wide association study of plasma lipids in >300,000 individuals. Nat. Genet. 49, 1758-1766 (2017).
[00215] 3. Ratcliffe, W. A. et al. Oestradiol assays: applications and guidelines for the provision of a clinical biochemistry service. Ann. Clin. Biochem. 25 (Pt 5), 466-483 (1988).
[00216] 4. Arnold, M. Biomarker assay quality procedures: approaches used to minimise systematic and random errors (and the wider epidemiological implications). UK Biobank biomarker document showcase (2019). Available at: biobank.ctsu.ox.ac.uk/crystal/docs/biomarker_issues.pdf. (Accessed: 4th October 2019)
[00217] 5. NIDDK. Quick Reference on UACR & GFR In Evaluating Patients with Diabetes for
Kidney Disease. NIDDK (03/2012). Available at: niddk.nih.gov/health- information/professionals/clinical-tools-patient-education-outreach/quick-reference-uacr-gfr. (Accessed: 19th April 2019)
[00218] 6. Kathiresan, S. et al. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Medical Genetics 8, S17 (2007).
[00219] 7. Snell-Bergeon, J. K. et al. Evaluation of urinary biomarkers for coronary artery disease, diabetes, and diabetic kidney disease. Diabetes Technol. Ther. 11, 1-9 (2009).
[00220] 8. Eastwood, S. V. et al. Algorithms for the Capture and Adjudication of Prevalent and
Incident Diabetes in UK Biobank. PLoS One 11, e0162388 (2016).
[00221] 9. Finucane, H. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types doi: 10.1101/103069
[00222] 10. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279-1283 (2016).
[00223] 11. Devlin, B. & Roeder, K. Genomic Control for Association Studies. Biometrics 55,
997-1004 (1999).
[00224] 12. Holbrook, J. A., Neu-Yilik, G., Hentze, M. W. & Kulozik, A. E. Nonsense-mediated decay approaches the clinic. Nat. Genet. 36, 801-808 (2004).
[00225] 13. Stenson, P. D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1-9 (2014). [00226] 14. Cohen, J. C., Boerwinkle, E., Mosley, T. H., Jr & Hobbs, H. H. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Eng!. J. Med. 354, 1264-1272 (2006).
[00227] 15. DeBoever, C. et a!. Medical relevance of protein-truncating variants across
337,205 individuals in the UK Biobank study. Nat Commun. 9, 1612 (2018).
[00228] 16. Dilthey, A. T. et a!. HLA*LA - HLA typing from linearly projected graph alignments.
Bioinformatics (2019). doi: 10.1093/bioinformatics/btz235
[00229] 17. Dilthey, A. T. et at High-Accuracy HLA Type Inference from Whole-Genome
Sequencing Data Using Population Reference Graphs. PLoS Comput. Biol. 12, e1005151 (2016).
[00230] 18. Fernando, M. M. A. et a!. Defining the role of the MHC in autoimmunity: a review and pooled analysis. PLoS Genet. 4, e1000024 (2008).
[00231] 19. Fernandez, C. A. et al. HLA-DRB1*07:01 is associated with a higher risk of asparaginase allergies. Blood 124, 1266-1276 (2014).
[00232] 20. Aguirre, M., Rivas, M. & Priest, J. Phenome-wide burden of copy number variation in UK Biobank doi: 10.1101/545996
[00233] 21. Walters, R. G. et al. A new highly penetrant form of obesity due to deletions on chromosome 16p11.2. Nature 463, 671-675 (2010).
[00234] 22. Horikawa, Y. et al. Mutation in hepatocyte nuclear factor-1 beta gene (TCF2) associated with MODY. Nat. Genet. 17, 384-385 (1997).
[00235] 23. Iwasaki, N. etal. Liver and kidney function in Japanese patients with maturity-onset diabetes of the young. Diabetes Care 21, 2144-2148 (1998).
[00236] 24. Nishigori, H. et al. Frameshift mutation, A263fsinsGG, in the hepatocyte nuclear factor-1 beta gene associated with diabetes and renal dysfunction. Diabetes 47, 1354-1355 (1998).
[00237] 25. Heisterkamp, N., Groffen, J., Warburton, D. & Sneddon, T. P. The human gamma- glutamyltransferase gene family. Hum. Genet. 123, 321-332 (2008).
[00238] 26. Pirttila, T. J. et al. Cystatin C expression is associated with granule cell dispersion in epilepsy. Ann. Neurol. 58, 211-223 (2005).
[00239] 27. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291-295 (2015).
[00240] 28. Shi, H., Mancuso, N., Spendlove, S. & Pasaniuc, B. Local Genetic Correlation
Gives Insights into the Shared Genetic Architecture of Complex Traits. Am. J. Hum. Genet. 101, 737-751 (2017).
[00241] 29. Shi, H., Kichaev, G. & Pasaniuc, B. Contrasting the genetic architecture of 30 complex traits from summary association data doi: 10.1101/035907 [00242] 30. Calderon, D. et al. Inferring Relevant Cell Types for Complex Traits by Using
Single-Cell Gene Expression. Am. J. Hum. Genet. 101, 686-699 (2017).
[00243] 31. McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495-501 (2010).
[00244] 32. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nature Genetics 47, 1236-1241 (2015).
[00245] 33. Verbanck, M., Chen, C.-Y., Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet. 50, 693-698 (2018).
[00246] 34. O’Connor, L. J. & Price, A. L. Distinguishing genetic correlation from causation across 52 diseases and complex traits. Nat. Genet. 50, 1728-1734 (2018).
[00247] 35. Curhan, G. C., Willett, W. C., Rimm, E. B. & Stampfer, M. J. A prospective study of dietary calcium and other nutrients and the risk of symptomatic kidney stones. N. Engl. J. Med. 328, 833-838 (1993).
[00248] 36. Wray, N. R., Kemper, K. E., Hayes, B. J., Goddard, M. E. & Visscher, P. M.
Complex Trait Prediction from Genome Data: Contrasting EBV in Livestock to PRS in Humans. Genetics 211, 1131-1141 (2019).
[00249] 37. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. Journal of the
Royal Statistical Society: Series B (Methodological) 58, 267-288 (1996).
[00250] 38. Qian, J. et al. A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems doi: 10.1101/630079
[00251] 39. Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584-591 (2019).
[00252] 40. Yki-Jarvinen, H. Fat in the liver and insulin resistance. Annals of Medicine 37, 347-
356 (2005).
[00253] 41. Kotronen, A. et al. Prediction of non-alcoholic fatty liver disease and liver fat using metabolic and genetic factors. Gastroenterology 137, 865-872 (2009).
[00254] 42. Mahajan, A. et al. Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 diabetes. Nat. Genet. 50, 559-571 (2018).
[00255] 43. Krapohl, E. et al. Multi-polygenic score approach to trait prediction. Mol. Psychiatry
23, 1368-1374 (2018).
[00256] 44. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203-209 (2018).
[00257] 45. Tanigawa, Y. et al. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight novel adipocyte biology doi: 10.1101/442715 [00258] 46. Levey AS, E. al. A new equation to estimate glomerular filtration rate. - PubMed -
NCBI. Available at: ncbi.nlm.nih.gov/pubmed/19414839. (Accessed: 6th May 2019)
[00259] 47. Gusev, A. et al. Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nat. Genet. 50, 538-548 (2018).
[00260] 48. Smith, C. L. & Eppig, J. T. Expanding the mammalian phenotype ontology to support automated exchange of high throughput mouse phenotyping data generated by large- scale mouse knockout screens. J. Biomed. Semantics 6, 11 (2015).
[00261] 49. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
[00262] 50. Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. Elife 1, (2018).
[00263] 51. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological) 57, 289-300 (1995).
[00264] 52. Shannon P, E. al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. - PubMed - NCBI. Available at: ncbi.nlm.nih.gov/pubmed/14597658. (Accessed: 30th April 2019)
Example 2
Aggregation of Multiple Signals of Polygenic Risk to Improve Genetically Informed Prediction of Complex Traits
[00265] Here we describe a novel method for improving the predictive ability of disease models by leveraging association signals of numerous underlying traits, which could be the basis for a groundbreaking medical diagnostic tool. Using data from a large cohort study, the UK Biobank, with -500,000 genotyped and phenotyped individuals, we developed precision polygenic risk scores (PRSs) for a number of clinical laboratory tests of serum and urine, including lipids, hormones, and measures of kidney function. We combined biomarker PRSs with the trait PRS into a single combined score which captures a substantial degree more variance of the trait than the trait PRS alone and improved predictive performance, particularly in the tails of the distribution, to a startling degree.
[00266] For instance, when evaluating the prevalence of type 2 diabetes in the highest bin of 0.5% of the held out test individuals, we observe a prevalence of 67% in our combined score compared to 24% in the type 2 diabetes PRS alone. This improvement is also present at less extreme windows of risk (see FIG. 26). [00267] Our system consists of three main parts:
1) a database of polygenic risk scores (PRSs) for traits underlying common diseases, including but not limited to those below, in addition to those of the diseases themselves on the whole population cohort(s) and ascertained subsets thereof;
2) a predictive model, generated in the same or different individuals, to aggregate these polygenic scores (and possibly other covariates) into a single combined measure of risk for the target trait or disease; and
3) a diagnostic tool which assesses individual risk, using the predictive model above to assign individuals to particular levels of risk.
[00268] Our results strongly suggest that the currently available PRSs for most diseases are lacking in their potential to predict traits. This can be quantified as the ratio of the predictive ability of the PRS versus that of the full set of SNPs in the individual, termed the SNP-level heritability. For most complex traits, the SNP-level heritability is substantially higher, supporting the notion that PRSs can be significantly improved.
Polygenic risk score database
[00269] Our polygenic risk score database includes both publicly available, published results of genetic association studies, as well as novel datasets generated specifically for this purpose. This includes quantitative measures of health, including lipid measurements, glucose and HbA1c measurements, creatinine in serum and urine, cystatin C, potassium, and other proteins, metabolites, and elements; physiological measures such as pulse rate, blood pressure, EKG test results, blood oxygen, and other quantitative and quantized measurements of overall body state; anthropometries such as height, weight, BMI, fat mass, waist circumference, lung capacity, grip strength, gait, and related indicators of physical state and ability; direct tests of derived cell lines, extracted samples, or other biological materials for proliferation, quantity, gene expression, protein expression, telomere length, methylation state, mitochondrial DNA content, organelle morphology, cellular morphology, chromatin structure, chromatin state, response to perturbation, response to stimulation, or any other specific assay of interest in the given sample or for the given disease or any of its comorbidities, risk factors or associated outcomes; medical events, such as drugs taken, surgeries performed, or diagnoses given; life events such as experiences in childhood, age of menarche, age of first facial hair, number of children, age of menopause, or any other associated or disease-relevant change over the life course; consumption patterns, such as water intake, food intake, and dietary preferences; mental and social measurements such as education level, smoking status, and depression, either for an individual or the postal code, county, state, or country in which they live; and proxies thereof, either through interviewing of relatives, extraction of death, birth, medical, infection disease, cancer, and other records, or through asking individual participants about the health, social, physical, physiological and/or behavioral status of their relatives or acquaintances. This database might be generated from full cohorts or subsets of individuals with any set of these or other available phenotypic, technical, or other covariates to increase the signal of each PRS.
Predictive model
[00270] The predictive model provides a way of aggregating information from multiple PRSs to maximize performance on a single target trait. The simplest version of this consists of predicting individual phenotypes using a regression model to weight information from the different polygenic scores. This can also be done using more advanced machine learning methods to aggregate information, including through random forest or deep neural network approaches.
[00271] Model fitting can consist of multiple stages, in which model selection is done in the presence of adjusting covariates, or meta-analyzed across multiple studies or in different temporal collections of the same study; within an individual, in which multiple measurements are used along with information about the individual’s state (e.g. drugs taken, major surgeries undergone, etc.); or in proxies of individuals, such as their relatives or geographically, socially, economically, and/or behaviorally similar individuals are aggregated to provide an estimate of effects of each polygenic score on the outcomes of a given person.
Diagnostic tool
[00272] Finally, we apply the results of the diagnostic tool by genotyping the individual to be tested and collecting any or some of the associated covariates (as part of the predictive model and/or the polygenic score database detailed above) and generating and estimate of the trait value, disease likelihood, age of onset, severity, number of complications, etc. that are included thereof.
[00273] We envision this diagnostic would improve portability to non-European populations; increase the predictive power and clinical utility in general; and, in combination with a direct genotyping service, provide a fast, easy, and effective strategy for disease and trait forecasting at any stage of life.

Claims

WHAT IS CLAIMED IS:
1. A method of predicting the risk of an individual developing a polygenic disease or medically relevant trait, the method comprising:
a) providing a database comprising correlation data for associations between genetic variants and the disease or medically relevant trait based on genome-wide testing of a population for genetic variants associated with the disease or the medically relevant trait;
b) genotyping the individual to determine if the individual has one or more of the genetic variants associated with the disease or the medically relevant phenotypic trait;
c) calculating at least one polygenic risk score based on the genetic variants detected in the individual by genotyping, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait.
2. The method of claim 1 , wherein the genetic variants are selected from the group consisting of protein-truncating variants (PTVs), protein-altering variants, non-coding variants, human leukocyte antigen (HLA) allelotypes, and copy number variations (CNVs).
3. The method of claim 2, wherein the individual has at least one protein truncating variant (PTV), copy number variation (CNV), or human leukocyte antigen (HLA) allele that correlates with a size-effect change in a measurement of at least one clinical biomarker in the individual compared to that of the clinical biomarker in a control subject who has a wild-type allele.
4. The method of any one of claims 1 to 3, wherein the individual has a plurality of variant alleles selected from Tables 5-10 and 13.
5. The method of claim 4, wherein the individual has at least one HLA allele selected from Tables 8a and 8b.
6. The method of claim 4, wherein the individual has at least one CNV selected from Tables 9 and 10.
7. The method of claim 4, wherein the individual has at least one PTV selected from the group consisting of:
a) a PTV in APOB that correlates with decreased levels of LDL, apolipoprotein B or triglycerides; b) a PTV in GPT that correlates with decreased levels of alanine aminotransferase; c) a PTV in IQGAP2 and ALB that correlates with decreased levels of albumin;
d) a PTV in GPLD1 and ALPL correlates with decreased levels of alkaline phosphatase; e) a PTV in APOA5, CHFT8, and LCAT that correlates with decreased levels of apolipoprotein A and HDL;
f) a PTV in ZNF229 that correlates with decreased levels of apolipoprotein B;
g) a PTV in PDE3B that correlates with decreased levels of apolipoprotein B or triglycerides;
h) a PTV in CST3 that correlates with decreased levels of cystatin C or triglycerides; a PTV in SAG that correlates with decreased levels of bilirubin;
i) a PTV in SLC22A2 and RNF186 that correlates with decreased levels of estimated glomerular filtration rate (eGFR);
j) a PTV in RHAG and G6PC2 that correlates with decreased levels of glucose or HbA1c; k) a PTV in MSR1 that correlates with decreased levels of IGF1 ;
L) a PTV in LPA that correlates with decreased levels of lipoprotein A;
m) a PTV in TNFRSF13B that correlates with decreased levels of non-albumin protein; n) a PTV in ANGPTL8 and LPL that correlates with decreased levels of triglycerides; o) a PTV in DRD5, PDZK1, or SLC22A12 that correlates with decreased levels of urate; p) a PTV in INSC that correlates with decreased levels of vitamin D;
q) a PTV in LIPC, PDE3B, and LPL that correlates with increased levels of apolipoprotein A or HDL;
r) a PTV in FUT2 or RAP1GAP that correlates with increased levels of alkaline phosphatase;
s) a PTV in ABCG8 that correlates with increased levels of cholesterol;
t) a PTV in RNF186 or SLC22A2 that correlates with increased levels of creatinine; u) a PTV in SLC01B1 or UGT1A10 that correlates with increased levels of bilirubin; v) a PTV in RORC, SIGLEC1, or UPB1 that correlates with increased levels of gamma glutamyltransferase;
w) a PTV in ANGPTL8 that correlates with increased levels of HDL;
x) a PTV in SLC22A1 and SLC22A2 that correlates with increased levels of lipoprotein A;
y) a PTV in COL4A4 that correlates with increased levels of microalbumin in serum and urine and an increased urine albumin to creatinine ratio;
z) a PTV in HSPA6 that correlates with increased levels of total protein and non-albumin protein;
aa) a PTV in APOA5 that correlates with increased levels of triglycerides; bb) a PTV in PYGM and SLC22A11 that correlates with increased levels of urate; and cc) a PTV in APOB, DHCR7, FLG, and NPFFR2 that correlates with increased levels of vitamin D.
8. The method of any one of claims 3 to 7, wherein the individual has at least one HLA allele selected from the group consisting of:
a) HLA-B*08:01 , HLA-DRB1*03:01 , or HLA-DRB1*07:01 (OR = 0.796) that correlates with abnormal rheumatoid factor levels above 16 U/mL;
b) A HLA-DR3 haplotype that correlates with a predisposition for developing lupus, multiple sclerosis, or type 1 diabetes; and
c) HLA-DRB1*07:01 allelotype that correlates with a predisposition for developing an asparaginase allergy.
9. The method of any one of claims 1 to 8, wherein at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement.
10. The method of claim 9, wherein the clinical biomarker is a serum or urine biomarker.
11. The method of claim 9 or 10, wherein the clinical biomarker is selected from the group consisting of alanine aminotransferase, albumin, alkaline phosphatase, apolipoprotein A, apolipoprotein B, aspartate aminotransferase, calcium, cholesterol, c-reactive protein, creatinine, cystatin-C, direct bilirubin, gamma glutamyltransferase, glucose, glycated hemoglobin (HbA1c), HDL cholesterol, insulin-like growth factor 1 (IGF-1), low-density lipoprotein (LDL) direct, lipoprotein-A, phosphate, sex hormone binding globulin (SHBG), testosterone, total bilirubin, total protein, triglycerides, urate, urea, vitamin D, creatinine in urine, estimated glomerular filtration rate (eGFR), microalbumin in urine, potassium in urine, sodium in urine, non-albumin protein, urine albumin to creatinine ratio higher than 30 mg/g, microalbumin higher than 40 mg/L, rheumatoid factor higher than 16 lll/ml, estradiol higher than 212 pmol/L.
12. The method of any one of claims 9 to 11 , further comprising measuring the clinical biomarker in the individual.
13. The method of any one of claims 1 to 12, wherein at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and the disease or the medically relevant trait.
14. The method of claim 13, wherein the disease or medically relevant trait is selected from the group consisting of type 2 diabetes, primary biliary cirrhosis, rheumatoid arthritis, schizophrenia, lupus, ulcerative colitis, sunburn, Crohn’s disease, allergy/eczema, hypothyroidism, age of menarche, age of menopause, systolic blood pressure, basophil percentage, eosinophil percentage, hematocrit, hemoglobin concentration, reticulocyte count, reticulocyte percentage, immature reticulocyte, fraction, lymphocyte count, lymphocyte percentage, mean corpuscular hemoglobin (MCH), MCH concentration, mean corpuscular volume (MCV), mean platelet thrombocyte volume (MPV), mean reticulocyte volume, mean sphered cell volume, monocyte count, monocyte percentage, neutrophil count, neutrophil percentage, platelet count, platelet crit, platelet distribution width (PDW), red blood cell erythrocyte (RBC) count, Red blood cell erythrocyte distribution width (RDW), reticulocyte count, reticulocyte percentage, white blood cell leukocyte count (WBC) count, respiratory disease, amyotrophic lateral sclerosis (ALS), Alzheimer’s disease, age related macular degeneration (AMD), any stroke, any ischemic stroke, large artery stroke, cardioembolic stroke, small vessel stroke, age of menarche, prostate cancer, number of cancers, number of operations, average weekly beer/cider intake, average weekly spirits intake, body size at age 10, height size at age 10, fathers age at death, mothers age at death, deep venous thrombosis (DVT), gastric reflux, gall stones, kidney stone, hyperthyroidism, osteoporosis, uterine fibroids, hay fever/allergic rhinitis, enlarged prostate, gout, hiatus hernia, sitting height, birth weight, mother’s Alzheimer’s disease, neuroticism, best measure of forced expiratory volume in 1 second FEV1 (FEV 1 s best), best measure of forced vital capacity (FVC best), predicted percentage of forced expiratory volume in 1 second (FEV 1s predicted), nerves anxiety tension depression, body mass index (BMI), pulse wave arterial stiffness, whole body fat mass, whole body fat free mass, whole body water mass, age of first facial hair, hair balding pattern 2, hair balding pattern 3, hair balding pattern 4, diabetes, cancer, fracture bones, oral contraceptives, hormone replacement therapy, bilateral oophorectomy, forced vital capacity (FVC), forced expiratory volume in 1 second (FEV 1 s), peak expiratory flow (PEF), hysterectomy, pregnancy terminations, age of primiparous, heel bone mineral density (BMD) left, heel BMD right, pulse rate, pulse wave peak to peak, hand grip strength left, leg pain on walking, hand grip strength right, tinnitus, waist circumference, hip circumference, standing height, maximum workload during fitness test, maximum heart rate during fitness test, qualifications a/as levels, diabetes related eye disease, cataract, painful gums, bleeding gums, dentures, vascular problems, angina, high blood pressure, fractured bones, blood clot in the leg, emphysema/bronchitis, asthma, hay fever allergic rhinitis eczema, cholesterol lowering medication, hormone replacement therapy, estrogen receptor (ER) negative breast cancer, ER positive breast cancer, all breast cancer, coronary artery disease, asthma, insomnia, sleep hours, anorexia, autism, celiac disease, EGFR decline, microalbuminuria, kidney disease.
15. The method of claim any one of claims 1 to 14, wherein a Spearman correlation is used to generate the correlation data.
16. The method of any one of claims 1 to 14, wherein the correlation data is selected from Tables 4-10 and 13.
17. The method of any one of claims 1 to 16, wherein at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement, and at least one PRS is calculated for a genetic association between the genetic variants detected in the individual by genotyping and the disease or the medically relevant trait.
18. The method of claim 17, further comprising:
a) generating a predictive model using one or more algorithms, wherein said predictive model is based on said at least one PRS for the genetic association with a size effect on a clinical biomarker measurement and said at least one PRS for the genetic association with the disease or the medically relevant trait; and
b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS.
19. The method of claim 18, wherein said one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm.
20. The method of claim 19, wherein the machine learning algorithm is a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
21. The method of any one of claims 1 to 20, further comprising treating the individual for the disease if the polygenic risk score indicates that the individual has the disease.
22. The method of any one of claims 1 to 21 , wherein said genotyping comprises sequencing at least part of a genome of one or more cells from the individual.
23. The method of claim 22, wherein said genotyping comprises sequencing the whole genome of the individual.
24. The method of any one of claims 1 to 23, further comprising adjusting at least one PRS for one or more covariates.
25. The method of claim 24, wherein the covariates are selected from the group consisting of age, sex, socioeconomic status, ethnicity, and anthropometric measurements.
26. The method of any one of claims 1 to 25, wherein the disease is myocardial infarction.
27. The method of claim 26, wherein the method comprises calculating at least one polygenic risk score for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement selected from tyrosine, glycoprotein acetyls, CH2 in fatty acids, arachidonic acid, pulse, sleep, vitamin D, urate, triglycerides, total protein, sodium in urine, phosphate, lipoprotein A, high density lipoprotein cholesterol, low density lipoprotein cholesterol, total cholesterol, ApoA, ApoB, Albumin, HbA1c, hemoglobin, diastolic blood pressure, CysC, proinsulin, glycoprotein, omega 6 fatty acid, macrophage colony stimulating factor, cutaneous T-cell-attracting chemokine, waist to hip ratio, fat mass, total protein, sleep hours, urate, sodium in urine, gamma glutamyltransferase, lymphocyte count, hand grip strength, forced vital capacity, fasting insulin (sex specific); and the disease, diabetes.
28. The method of claim 26 or 27, further comprising identifying whether additional risk factors are present selected from age, tobacco, high blood pressure, metabolic syndrome, obesity, overweight, pre-eclampsia, family history, physical activity, stress, drug use, sleep, autoimmune disease, and menopause, wherein the presence of one or more of said additional risk factors indicates that the individual is at increased risk of developing myocardial infarction.
29. The method of any one of claims 1 to 25, wherein the disease is diabetes.
30. The method of claim 29, wherein the method comprises calculating at least one polygenic risk score for a genetic association between the genetic variants detected in the individual by genotyping and a size effect on a clinical biomarker measurement selected from waist to hip ratio, fat mass, waist circumference, pulse, sex hormone binding globulin, IGF1 , high density lipoprotein cholesterol, lipoprotein A, ApoA, alanine aminotransferase, Hip circumference, HbA1c, glucose, diastolic blood pressure, BMI, platelet derived growth factor, vascular endothelial growth factor, total 20:0 long chain fatty acids, albumin, water intake, vitamin D, total bilirubin, testosterone, direct bilirubin, lymphocyte count, C-reactive protein, left hand grip strength, forced vital capacity, forced expiratory volume in 1 second, and total body fat, and various diabetes polygenic scores with and without adjustment for BMI.
31. The method of claim 30, further comprising adjusting at least one PRS for the Townsend deprivation index or socioeconomic status.
32. The method of any one of claims 29 to 31 , further comprising identifying whether additional risk factors are present selected from smoking, alcohol consumption, gestational diabetes, family history of diabetes, obesity, overweight, age, heart disease, stroke, depression, PCOS, physical activity, menopause, and acanthosis nigricans, wherein the presence of one or more of said additional risk factors indicates that the individual is at increased risk of developing diabetes.
33. A database comprising correlation data between genetic variants and clinical biomarkers, diseases, and medically relevant traits, wherein the correlation data is selected from Tables 4-10 and 13.
34. A computer implemented method for predicting the risk of an individual developing a disease or medically relevant phenotypic trait, the computer performing steps comprising: a) receiving genome sequencing data for an individual;
b) identifying variant alleles present in the individual from the genome sequencing data, wherein the individual has a plurality of variant alleles selected from Tables 5-10 and 13;
c) calculating at least one polygenic risk score using the database of claim 33 based on the variant alleles present in the individual, wherein the polygenic risk score (PRS) indicates the risk of the individual developing the disease or the medically relevant trait; and d) displaying information regarding the risk of the individual developing the disease or the medically relevant trait.
35. The computer implemented method of claim 34, further comprising:
a) generating a predictive model using one or more algorithms, wherein the predictive model is based on at least one PRS for a genetic association with a size effect on a clinical biomarker measurement and at least one PRS for a genetic association with the disease or the medically relevant trait; and
b) calculating a combined risk score from the predictive model, wherein the combined risk score better predicts the risk of the individual developing the disease or the medically relevant trait than each separate PRS.
36. The computer implemented method of claim 35, wherein said one or more algorithms are selected from the group consisting of a classification algorithm, a regression algorithm, and a machine learning algorithm.
37. The computer implemented method of claim 36, wherein the machine learning algorithm is a random forest algorithm, a deep neural network algorithm, or a Bayesian model averaging algorithm.
38. The computer implemented method of any one of claims 34 to 37, further comprising storing the information regarding the risk of the individual developing the disease or the medically relevant phenotypic trait in a database.
39. A system for predicting the risk of an individual developing a disease or medically relevant trait using the computer implemented method of any one of claims 34 to 38, the system comprising:
a) a storage component for storing data, wherein the storage component has instructions for predicting the risk of an individual developing a disease or medically relevant trait based on analysis of the genome sequencing data stored therein;
b) a computer processor for processing the genome sequencing data using one or more algorithms, wherein the computer processor is coupled to the storage component and configured to execute the instructions stored in the storage component in order to receive the inputted genome sequencing data and analyze the data according to the computer implemented method of any one of claims 34 to 38; and c) a display component for displaying the information regarding the risk of the individual developing the disease or the medically relevant trait.
40. A non-transitory computer-readable medium comprising program instructions that, when executed by a processor in a computer, causes the processor to perform the method of any one of claims 34 to 38.
41. A kit comprising the non-transitory computer-readable medium of claim 40 and instructions for predicting the risk of an individual developing a disease or medically relevant trait.
PCT/US2020/034303 2019-05-24 2020-05-22 Methods for diagnosis of polygenic diseases and phenotypes from genetic variation WO2020242976A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962852738P 2019-05-24 2019-05-24
US62/852,738 2019-05-24

Publications (1)

Publication Number Publication Date
WO2020242976A1 true WO2020242976A1 (en) 2020-12-03

Family

ID=73553051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/034303 WO2020242976A1 (en) 2019-05-24 2020-05-22 Methods for diagnosis of polygenic diseases and phenotypes from genetic variation

Country Status (1)

Country Link
WO (1) WO2020242976A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735594A (en) * 2020-12-29 2021-04-30 北京优迅医疗器械有限公司 Method for screening disease phenotype related mutation sites and application thereof
CN112852949A (en) * 2021-02-23 2021-05-28 石河子大学 Molecular marker of Kazakh EH, primer pair and application thereof
CN113096816A (en) * 2021-03-18 2021-07-09 西安交通大学 Method, system, equipment and storage medium for establishing brain disease morbidity risk prediction model
WO2021178952A1 (en) * 2020-03-06 2021-09-10 The Research Institute At Nationwide Children's Hospital Genome dashboard
US11136386B2 (en) 2019-05-14 2021-10-05 Prometheus Biosciences, Inc. Methods of treating Crohn's disease or ulcerative colitis by administering inhibitors of tumor necrosis factor-like cytokine 1A (TL1A)
WO2022087478A1 (en) * 2020-10-23 2022-04-28 23Andme, Inc. Machine learning platform for generating risk models
CN115841872A (en) * 2023-02-22 2023-03-24 中国疾病预防控制中心环境与健康相关产品安全所 Method and device for predicting life of old people and computer readable storage medium
WO2023129621A1 (en) * 2021-12-29 2023-07-06 Illumina, Inc. Rare variant polygenic risk scores
CN116825208A (en) * 2023-06-06 2023-09-29 吉林大学 Multi-factor large-scale data integration analysis method based on Mendelian randomization
CN117334325A (en) * 2023-09-26 2024-01-02 中山大学肿瘤防治中心(中山大学附属肿瘤医院、中山大学肿瘤研究所) Application of LCAT in diagnosis, treatment and recurrence prediction of hepatocellular carcinoma
CN117558039A (en) * 2024-01-09 2024-02-13 南京氧富智能医疗科技有限公司 Automatic artery naming model construction and naming method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160046996A1 (en) * 2010-09-13 2016-02-18 The Children's Hospital Of Philadelphia Common and Rare Genetic Variations Associated with Common Variable Immunodeficiency (CVID) and Methods of Use Thereof for the Treatment and Diagnosis of the Same

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160046996A1 (en) * 2010-09-13 2016-02-18 The Children's Hospital Of Philadelphia Common and Rare Genetic Variations Associated with Common Variable Immunodeficiency (CVID) and Methods of Use Thereof for the Treatment and Diagnosis of the Same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MOSTAFAVI ET AL.: "Variable prediction accuracy of polygenic scores within an ancestry group", ELIFE, vol. 9, no. Article e48376, 7 May 2019 (2019-05-07), pages 1 - 33, XP055763375 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11136386B2 (en) 2019-05-14 2021-10-05 Prometheus Biosciences, Inc. Methods of treating Crohn's disease or ulcerative colitis by administering inhibitors of tumor necrosis factor-like cytokine 1A (TL1A)
WO2021178952A1 (en) * 2020-03-06 2021-09-10 The Research Institute At Nationwide Children's Hospital Genome dashboard
WO2022087478A1 (en) * 2020-10-23 2022-04-28 23Andme, Inc. Machine learning platform for generating risk models
CN112735594B (en) * 2020-12-29 2024-04-16 北京优迅医疗器械有限公司 Method for screening mutation sites related to disease phenotype and application thereof
CN112735594A (en) * 2020-12-29 2021-04-30 北京优迅医疗器械有限公司 Method for screening disease phenotype related mutation sites and application thereof
CN112852949A (en) * 2021-02-23 2021-05-28 石河子大学 Molecular marker of Kazakh EH, primer pair and application thereof
CN113096816A (en) * 2021-03-18 2021-07-09 西安交通大学 Method, system, equipment and storage medium for establishing brain disease morbidity risk prediction model
CN113096816B (en) * 2021-03-18 2023-06-13 西安交通大学 Brain disease onset risk prediction model establishment method, system, equipment and storage medium
WO2023129621A1 (en) * 2021-12-29 2023-07-06 Illumina, Inc. Rare variant polygenic risk scores
CN115841872A (en) * 2023-02-22 2023-03-24 中国疾病预防控制中心环境与健康相关产品安全所 Method and device for predicting life of old people and computer readable storage medium
CN116825208A (en) * 2023-06-06 2023-09-29 吉林大学 Multi-factor large-scale data integration analysis method based on Mendelian randomization
CN117334325A (en) * 2023-09-26 2024-01-02 中山大学肿瘤防治中心(中山大学附属肿瘤医院、中山大学肿瘤研究所) Application of LCAT in diagnosis, treatment and recurrence prediction of hepatocellular carcinoma
CN117334325B (en) * 2023-09-26 2024-04-16 中山大学肿瘤防治中心(中山大学附属肿瘤医院、中山大学肿瘤研究所) Application of LCAT in diagnosis, treatment and recurrence prediction of hepatocellular carcinoma
CN117558039A (en) * 2024-01-09 2024-02-13 南京氧富智能医疗科技有限公司 Automatic artery naming model construction and naming method and device
CN117558039B (en) * 2024-01-09 2024-03-15 南京氧富智能医疗科技有限公司 Automatic artery naming model construction and naming method and device

Similar Documents

Publication Publication Date Title
WO2020242976A1 (en) Methods for diagnosis of polygenic diseases and phenotypes from genetic variation
Pei et al. The genetic architecture of appendicular lean mass characterized by association analysis in the UK Biobank study
Gudbjartsson et al. Large-scale whole-genome sequencing of the Icelandic population
Holliday et al. Common variants at 6p21. 1 are associated with large artery atherosclerotic stroke
Mata et al. SNCA variant associated with Parkinson disease and plasma α-synuclein level
Gerring et al. Genome-wide DNA methylation profiling in whole blood reveals epigenetic signatures associated with migraine
WO2021022225A1 (en) Methods and systems for detecting microsatellite instability of a cancer in a liquid biopsy assay
Koopmann et al. Genome-wide identification of expression quantitative trait loci (eQTLs) in human heart
US20150356243A1 (en) Systems and methods for identifying polymorphisms
EP2772553B1 (en) Methods for genetic analysis
EP2102651A2 (en) Genetic analysis systems and methods
CN116904572A (en) Compositions and methods for detecting susceptibility to cardiovascular disease
Hobbs et al. Conotruncal heart defects and common variants in maternal and fetal genes in folate, homocysteine, and transsulfuration pathways
US20170137886A1 (en) Physiogenomic method for predicting drug metabolism reserve for antidepressants and stimulants
Zhang et al. Genetic associations between sleep traits and cognitive ageing outcomes in the Hispanic Community Health Study/Study of Latinos
US20140087960A1 (en) Markers Related to Age-Related Macular Degeneration and Uses Therefor
Jordan et al. The landscape of pervasive horizontal pleiotropy in human genetic variation is driven by extreme polygenicity of human traits and diseases
Li et al. Ultra-low-coverage genome-wide association study—insights into gestational age using 17,844 embryo samples with preimplantation genetic testing
Pouget et al. Preliminary insights into the genetic architecture of postpartum depressive symptom severity using polygenic risk scores
Fan et al. Genotype data and derived genetic instruments of Adolescent Brain Cognitive Development Study® for better understanding of human brain development
Williams et al. Genome-wide association study of thyroid-stimulating hormone highlights new genes, pathways and associations with thyroid disease
Forrest et al. Ancestrally and temporally diverse analysis of penetrance of clinical variants in 72,434 individuals
Mortlock et al. An extremes of phenotype approach confirms significant genetic heterogeneity in patients with ulcerative colitis
Warmerdam et al. Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores
Niu et al. Plasma proteome variation and its genetic determinants in children and adolescents

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20812647

Country of ref document: EP

Kind code of ref document: A1