WO2015193427A1 - Determination and analysis of biomarkers in clinical samples - Google Patents

Determination and analysis of biomarkers in clinical samples Download PDF

Info

Publication number
WO2015193427A1
WO2015193427A1 PCT/EP2015/063698 EP2015063698W WO2015193427A1 WO 2015193427 A1 WO2015193427 A1 WO 2015193427A1 EP 2015063698 W EP2015063698 W EP 2015063698W WO 2015193427 A1 WO2015193427 A1 WO 2015193427A1
Authority
WO
WIPO (PCT)
Prior art keywords
biomarker
phenotypic
genetic
abundance
disease
Prior art date
Application number
PCT/EP2015/063698
Other languages
French (fr)
Inventor
Ulf Gyllensten
Stefan ENROTH
Original Assignee
Olink Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB1410956.5A external-priority patent/GB201410956D0/en
Priority claimed from GB201414913A external-priority patent/GB201414913D0/en
Application filed by Olink Ab filed Critical Olink Ab
Publication of WO2015193427A1 publication Critical patent/WO2015193427A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention relates to methods for determining the level of a biomarker in a subject. More specifically, the present invention relates to a method for deriving
  • a method of identifying a biomarker e.g. for use in the diagnosis or monitoring of a disease or its treatment in a subject, is also provided.
  • Biomarkers typically protein biomarkers, are used for diagnosis and management of cancers and other diseases. Examples include prostate-specific antigen (PSA) used to screen for prostate cancer, the ovarian cancer-related tumour marker CA125 and IL-6, which is a drug target in rheumatoid arthritis. Many other biomarkers are in clinical use or have been proposed or are being investigated, either as markers for use in disease detection or diagnosis, to predict responsiveness to a medicament or other therapy, and/or to monitor the progress of a disease and/or its treatment. As well as protein biomarkers, genetic markers have also been identified and investigated. The discovery of putative biomarkers for the early identification and management of cancer and other diseases has been greatly facilitated in recent years by high throughput, genome-wide assays. Gene expression analyses have discovered numerous genes that are differentially expressed between cancerous or diseased tissue and healthy tissue, but few have proven suitable for use as biomarkers, mainly because mRNA levels do not correlate well with protein abundance.
  • Biomarkers used for disease diagnosis or monitoring should ideally be uniquely present or overexpressed in the diseased tissue or blood and not influenced by confounding factors, that is display deviating levels in affected individuals only, and be robust to factors unrelated to disease.
  • most current biomarkers have a function in a normal cell, taking part in e.g. signalling pathways, controlling growth, apoptosis and/or inflammation. They are not uniquely expressed in cancerous or diseased tissue. Additionally, the level of these biomarkers may be affected by a number of factors, such as an individual's genetic and physical constitution, lifestyle and medication.
  • biomarkers may in certain conditions be affected by various factors such as medications taken, smoking or age, and that others may be affected by genetic variations present in an individual subject
  • biomarker variation there has not so far been a detailed systematic study of biomarker variation in a normal, non-diseased subjects, and the effects that different non- disease related factors, such as lifestyle, environmental, anthropomorphic and clinical factors may have on biomarker abundance levels.
  • the present inventors have undertaken such a study, to study the causes of variation in the abundance of levels in a clinical sample of a set of diverse established or putative biomarkers for different diseases, including cancer, autoimmune diseases and inflammatory conditions.
  • Further a genome-wide analysis was performed to study the possible effects of genetic variations in the population on the levels of biomarkers in the samples. This study is the first to measure biomarker abundance on a large scale in a general population, using the same technology for all the biomarkers and for all the subjects in the population, to assess contributing factors for normal variation.
  • the present invention thus aims to understand the factors that influence normal variation in levels of a biomarker in a clinical sample, with the goal of determining an individualised normal level of a biomarker for a test subject, to establish a personalised clinical cut-off value that would increase the sensitivity of using biomarkers in clinical practice.
  • the present invention provides a method of determining the effect that any of a number of lifestyle, anthropomorphic, clinical and genetic factors have on the level of a biomarker within a subject. Based on this information an individualised normal level of a biomarker may be derived for an individual subject given that individual's lifestyle, anthropomorphic, clinical and genetic factors, and used to determine an individualised clinical cut-off value for a biomarker in that subject, thereby to enable a more efficient use of a biomarker in personalised disease management.
  • the present invention provides a method for determining an individualised normal level of a biomarker for a test subject for use in analysis of said biomarker in the diagnosis or monitoring of a disease or its treatment in said subject, said method comprising:
  • step (b) analysing the control biomarker abundance levels of step (a) with respect to one or more non-disease related phenotypic factors to determine which phenotypic factors have a statistically significant effect on the biomarker abundance levels in said control population thereby to identify phenotypic covariates for said biomarker, and performing a statistical analysis to determine the effect of any such phenotypic covariate(s) identified on the variance of the control abundance levels;
  • step (c) optionally transforming the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, to obtain a normal distribution;
  • step (d) using the normalised residual control abundance level values from step (c), or if normalisation step (c) is not performed the residual control abundance levels from step (b) (which have been adjusted for the effect of the phenotypic covariates), in a step of statistical analysis of genetic data comprising genetic variants identified in said control population to determine whether any one or more non-disease related genetic covariate(s) have an effect on the abundance levels of said biomarker in said control population;
  • step (g) using the model of step (e) to determine a value for a normal level for the
  • step (f) abundance of the biomarker in a said sample from the said test subject having said individual phenotype and genotype as determined in step (f), thereby to determine an individualised normal level of the biomarker for said test subject.
  • the method of the invention relies on identifying non-disease related phenotypic and genetic factors which have a statistically significant effect on the abundance level of a biomarker.
  • the method provides a model which is capable of adjusting an abundance level of a biomarker in a test subject, thereby to determine an individualised normal level of a biomarker for a test subject once the identified phenotypic and genetic factors identified in steps (b) and (d) have been assessed for the test subject.
  • the model generated by the method of the invention integrates information on the relevant non-disease related phenotypic and/or genetic factors for a biomarker and uses this to determine, or to calculate, an individualised normal level.
  • the method may allow the use of biomarkers for use in the diagnosis or monitoring of a disease or its treatment that were previously not suitable for these purposes, thereby increasing the number of candidate biomarkers which could be used in a clinical setting.
  • a further aspect of the invention is directed to the generation of the model.
  • the present invention provides a method of generating a model which is capable of adjusting an abundance level of a biomarker in a sample of a body tissue or fluid for the effect of phenotypic and/or genetic covariates which affect the level of said biomarker in said sample, said method comprising:
  • control population free from said disease to obtain a set of control abundance levels for said biomarker in a said sample
  • step (b) analysing the control biomarker abundance levels of step (a) with respect to one or more non-disease related phenotypic factors to determine which phenotypic factors have a statistically significant effect on the biomarker abundance levels in said control population thereby to identify phenotypic covariates for said biomarker, and performing a statistical analysis step to determine the effect of any such phenotypic covariate(s) identified on the variance of the control abundance levels;
  • step (c) optionally transforming the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, to obtain a normal distribution;
  • step (d) using the normalised residual control abundance level values from step (c), or if normalisation step (c) is not performed the residual control abundance levels from step (b) (which have been adjusted for the effect of the phenotypic covariates), in a step of statistical analysis of genetic data comprising genetic variants identified in said control population to determine whether any one or more non-disease related genetic covariate(s) have an effect on the abundance levels of said biomarker in said control population; and (e) generating a model which is capable of adjusting an abundance level of a said biomarker in a said sample for the effect of the phenotypic and/or genetic covariates identified in steps (b) and (d).
  • the present invention also provides a method for determining an individualised normal level of a biomarker for a test subject for use in analysis of said biomarker in the diagnosis or monitoring of a disease or its treatment in said subject, said method comprising:
  • step (ii) using a model obtained according to the model generation method above to determine a value for a normal level for the abundance of the biomarker in a said sample from the said test subject having said individual phenotype and genotype as determined in step (f), thereby to determine an individualised normal level of the biomarker for said test subject.
  • the step of generating a model may be repeated in the context of an individual test subject, or group of test subjects, for example a group of subjects with a disease.
  • the same panel of covariates may be used, or a smaller subset of the covariates or indeed a selected subset or panel of covariates may be used, for example based on covariates common to the control population and the disease group.
  • the present invention provides a method of detecting a biomarker in a test subject, said method comprising:
  • a yet further aspect of the invention provides a method of diagnosing or monitoring a disease, or the treatment thereof, in a subject, said method comprising detecting the presence of a biomarker in said subject using the hereinbefore defined detection method.
  • model provided by the invention may be used to identify new biomarkers (e.g. to determine whether a new or known biomarker (e.g. a new or known protein) is useful as a biomarker for a particular disease) and/or confirm or establish the utility of putative or candidate biomarkers for a particular disease.
  • a new or known biomarker e.g. a new or known protein
  • a candidate or putative biomarker would be useful as a biomarker for a particular disease by comparing the adjusted value(s) of the abundance level of the candidate or putative biomarker derived from a subject (or population of subjects) with a disease to the adjusted value(s) of the abundance level of the candidate or putative biomarker derived from a subject (or population of subjects) free from the disease, i.e. a control subject or population.
  • a difference between the adjusted levels i.e. an increase or decrease
  • a statistically significant difference as defined below may be indicative that the biomarker would find utility in the diagnosis or monitoring of the disease or its treatment in a subject.
  • the analysis of a control subject or population need be performed only once, to identify the phenotypic and/or genetic covariates, and to analyse their effect, i.e. to determine the adjusted value for the abundance level of the biomarker in a control subject or population.
  • the step of adjusting the value(s) for the abundance level of a biomarker in a control subject or population may be repeated, e.g. in the context of an individual disease or candidate biomarker.
  • a further aspect of the invention provides a method of identifying a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject, said method comprising:
  • the presence of a difference between said adjusted values identifies the candidate biomarker as a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject.
  • the invention may be seen to provide a method of identifying a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject, said method comprising:
  • step (d') using the model obtained according to the invention as hereinbefore defined to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and genotypic covariates identified in step (b');
  • step (e') comparing said adjusted value of step (d') to the set of adjusted values of step (b') for the biomarker in a sample from a subject free from said disease; and (f) determining whether there is a difference between said adjusted value from the subject with said disease, and the adjusted values from the control population; wherein the presence of a difference between said adjusted values identifies the candidate biomarker as a biomarker for use in the diagnosis or monitoring of said disease or its treatment in a subject.
  • the use of the biomarker in the diagnosis or monitoring of a disease or its treatment in a subject preferably does not form part of (i.e. does not form a step in) the method of identifying a biomarker described above. It will be appreciated that for any given biomarker different phenotypic and genetic covariates may be identified and will be used. Although we have found that in many cases both phenotypic and genetic covariates are identified and therefore both types of covariate are used in the methods of the invention, in some cases only phenotypic or only genetic covariate(s) will be identified for any given biomarker. Thus, in such a case only one or more phenotypic or only one or more genetic covariates are used in the model generation and individual test subject assessment steps.
  • the methods of the invention are not limited to analysing single biomarkers, or one biomarker at a time, and one or more biomarkers may be analysed or assessed or identified.
  • the methods may be performed using a combination of two or more biomarkers. It is known in this regard that in some cases combinations of markers may be used together, and that such combinations may improve biomarker-based predictions.
  • a model may be generated for each biomarker in such a combination to correct for the effects of the covariates identified for that biomarker, and individualised levels determined for each biomarker and used in combination.
  • a combination may comprise 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 40, or 50 or more biomarkers, or alternatively up to any one of the aforementioned integers.
  • biomarkers of the present invention may include any type of molecule that can be detected in any clinical sample and used as a biomarker.
  • the biomarker can be any molecule that occurs in the body. It may for example be a protein or peptide or any molecule comprising a protein or peptide (hereinafter termed a proteinaceous molecule), a lipid or lipid-containing molecule, e.g. a fatty acid, steroid, lipoprotein, nucleic acid, carbohydrate, e.g. glycan or sugar.
  • the biomarker is a proteinaceous molecule, for instance any protein complex, soluble or insoluble protein, polypeptide or peptide; the terms “protein” and “proteinaceous” are used broadly herein to include proteins, polypeptides and peptides i.e. any molecule comprising amino acids linked by amide bonds, regardless of size.
  • the protein biomarker can play any functional or structural role on the body. Thus it may include a signalling peptide, pro-peptide, proteolysis product or hormone, blood protein, hormone, cytokine, antibody, lectin, selectin, connective tissue protein or indeed any structural protein, cell receptor, membrane protein, enzyme, e.g.
  • kinase phosphatase, protease, prion protein, apoptosis factor, or a protein involved in DNA replication or repair or regulation of gene expression etc. e.g. transcription factor etc.
  • blood proteins include albumins, globulins, fibrinogen, regulatory proteins and clotting factors.
  • Globulins may include Alpha 1 globulins, Alpha 2 globulins, Beta globulins (such as beta-2-micrroglobulin, plasminogen, angiostatins, propoerdin, shx hormone binding globulin and transferrin) and Gamma globulins, which may include
  • Immunoglobulins which may be IgA, IgD, IgE, IgG and IgM antibodies, or immunoglobulin heavy chain, immunoglobulin light chain, portions or fragments thereof or immunoglobulin domains.
  • Antibodies directed to specific antigens may be used a
  • cytokines examples include chemokines, Tumour Necrosis Factors (TNFs), or interleukins.
  • Classes of chemokines include CCL proteins (for example CCL1 , CCL2/MCP-1 , CCL3/MIP-1 a, CCL4/MIP-13, CCL5/RANTES, CCL6, CCL7, CCL8, CCL9, CCL1 1 , CCL12, CCL13, CCL14, CCL15, CCL16, CCL17, CCL18/PARC/DC-CK1 /AMAC-1 /Ml P-4, CCL19, CCL20, CCL21 , CCL22, CCL23, CCL24, CCL25, CCL26, CCL27, CCL28), CXCL proteins (for example CXCL1/KC,CXCL2, CXCL3, CXCL4, CXCL5, CXCL6, CXCL7, CXCL8/IL8, CXCL9, CXCL10, CXCL1 1 ,
  • TNF proteins may include TNF (formerly TNF-a), Lymphokines (TNFB/LTA or TNFC/LTB), TNFSF4,
  • TNFSF5/CD40LG TNFSF6, TNFSF7, TNFSF8 (also known as CD30-L), TNFSF9,
  • Interleukins may include both type I and type II interleukins.
  • Type I interleukins may include ST2, IL2, IL15, IL4, IL13, IL7, IL9, IL21 , IL3, IL5, GM-CSF, IL6, IL1 1 , IL27, IL30, IL31 , IL12, IL-12B, IL23.
  • Type II interleukins may include IL10 family
  • interleukins IL-10, it includes IL-19, IL-20, IL-22, IL-24 and IL-26, and interferons, including IFNA1 , IFNA2, IFNA4, IFNA5, IFNA6, IFNA7, IFNA8, IFNA10, IFNA13, IFNA14, IFNA16, IFNA17, IFNA21 , IFNB1 , IFNK, IFNW1 and IFN- ⁇ .
  • interferons including IFNA1 , IFNA2, IFNA4, IFNA5, IFNA6, IFNA7, IFNA8, IFNA10, IFNA13, IFNA14, IFNA16, IFNA17, IFNA21 , IFNB1 , IFNK, IFNW1 and IFN- ⁇ .
  • Cytokines may also include Macrophage colony stimulating factor 1 (CSF-1 ), I L-1 ra, TNFSF14, Kit ligand (SCF), Fms-related tyrosine kinase 3 ligand (FLT3LG) and TNF-related apoptosis-inducing ligand (TRAIL).
  • CSF-1 Macrophage colony stimulating factor 1
  • I L-1 ra I L-1 ra
  • TNFSF14 Kit ligand
  • SCF Kit ligand
  • Fms-related tyrosine kinase 3 ligand Fms-related tyrosine kinase 3 ligand
  • TRAIL TNF-related apoptosis-inducing ligand
  • enzymes examples include carbonic anhydrase 9 (CAIX), Thiopurine
  • UDP-glucuronosyltransferase 1 -1 UDP-glucuronosyltransferase 1 -1
  • IRT2 NAD-dependent deacetylase sirtuin-2
  • ECP Eosinophil Cationic Protein
  • Enzymes may also include proteases, such as stromelysin-1 (MMP-3), Matrix metalloproteinase-1 (MMP- 1 ), Matrix metalloprotease-7 (MMP-7), Matrix metalloproteinase-10 (MMP-10), Matrix metalloproteinase-12 (MMP-12), caspase-3 (CASP-3), caspase-8 (CASP-8), Kallikrein-6 (KLK6), Kallikrein-1 1 (hK1 1 ), Cathepsin-D (CTSD), Cathepsin L1 , prostasin (PRSS8), Renin, Tissue plasminogen activator ( tPA or PLAT), Pappalysin-1 (PAPPA), prostate-specific antigen (PSA), Membrane-bound aminopeptidase P and tartrate-resistant acid phosphatase type 5 (TR-AP)
  • Protease inhibitors such as WAP four-disulphide core domain protein 2 (WFDC2), metallopeptidase inhibitor 1 (TIMP1 ), and Cystatin-B (CPI-B) may also be detected in the method of the present invention.
  • Enzymes may also include kinases, for example B-Raf, mitogen-activated protein kinases and FIP1 L1 -PDGFR alpha kinase.
  • cell surface proteins examples include CD40, CD40-L (also known as CD154),
  • Tumor necrosis factor ligand superfamily member 6 FasL
  • FLT-3 Fms-related tyrosine kinase 3
  • TF Tissue Factor
  • Cell surface proteins may also include receptors, such as Estrogen Receptor (ER), progesterone receptor (PR), HER2,
  • Angiopoietin receptors TIE1 and TIE2 Basigin, Receptor for Advanced Glycation
  • RAGE Proto-oncogene tyrosine-protein kinase Src, LOX-1 , Protease activated receptors (PAR-1 , PAR-2 and PAR-3), Hepatocyte Growth Factor Receptor (HGF- R), TNF-R1 , TNF-R2, lnterleukin-6 receptor subunit alpha (IL-6RA), MHC class I polypeptide related sequence A (MIC-A), lnterleukin-17 receptor B (IL-17RB), lnterleukin-2 receptor subunit A (IL-2RA) lnterleukin-6 receptor subunit A (IL-6RA), Epidermal growth factor receptor (EGF-R), Receptor tyrosine-protein kinase erbB-2 (ErbB2), Receptor tyrosine- protein kinase erbB-3 (ErbB3), Receptor tyrosine-protein kinase erbB-4 (ErbB4), Plate
  • cell surface proteins may be a cell adhesion molecule, for instance carcinoembryonic antigen- related cell adhesion molecule 5 (CEA).
  • CEA carcinoembryonic antigen- related cell adhesion molecule 5
  • E-selectin also known as CD62E, ELAM-1 or LECAM2
  • PSGL-1 Selectin P ligand
  • PECAM- 1 Platelet endothelial cell adhesion molecule
  • Ep-CAM Epithelial cell adhesion molecule
  • Certain cell surface proteins are also known to be antigens that can be detected as markers for cancer, such as CA242, CD30 and mucin-16 (MUC1 -16/CA125). Cell surface proteins can also be cleaved from the cell membrane, and thus be detected as soluble proteins in a blood or tissue sample. Other soluble proteins that can act as markers for cancer include Human epididymis protein 4 (HE4),
  • the biomarker may also be a lectin. These may include Regenerating islet-derived protein 4 (REG-4), CD69 and galectin-3 (Gal-3).
  • REG-4 Regenerating islet-derived protein 4
  • CD69 CD69
  • Galectin-3 Galectin-3
  • connective tissue proteins include collagens (including collagen I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI, XXII XXIII, XXIV, XXV, XXVI, XXVII and XXVIII), elastin, fibrillins (including fibrillin 1 , 2, 3 and 4), fibulins (including fibulin 1 , 2, 3, 4, 5, and 7, and HMCN1 ), latent transforming growth factor binding proteins (LTPBs) (including LTBP 1 , 2, 3 and 4), perlecans, elualin and oxytalan.
  • collagens including collagen I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, X
  • Hormones that may biomarkers include Adrenomedullin (ADM), Agouti-related peptide (AgRP), Erythropoietin (EPO), follistatin (FS) prolactin (PRL), Amylin, Anti-Mijlerian hormone, adiponectin, adrenocorticotropic hormone, angiotensin, antidiuteric hormone (ADH), atrial-natriuretic peptide, brain natriuretic peptide, calcitonin, cholecystokinin, corticotropin-releasing hormone, ehkephalin, endothelin, Follicle-stimulating hormone (FSH), gelanin, gastrin, ghrelin, gonadotropin-releasing hormone, growth-hormone releasing hormone, hepcidin, human chorionic gonadotropin, human placental lactogen, inhibin, insulin-like factor, insulin, leptin, lipotropin
  • Growth factors may also be biomarkers in the methods of the present invention.
  • growth factors include epiregulin (EPR), betacellulin (BTC), Vascular endothelial growth factor A (VEGF-A), Vascular endothelial growth factor D (VEGF-D),
  • EGF Epidermal growth factor
  • MIA Melanoma inhibitory activity protein
  • AR amphiregullin
  • GDF-15 growth differentiation factor 15
  • GH Growth hormone
  • EGF-like growth factor HGF
  • HGF Hepatocyte growth factor
  • NGF Nerve growth factor
  • Beta-nerve growth factor Beta-NGF
  • Midkine connective tissue growth factor
  • CGF connective tissue growth factor
  • PDGF subunit B Platelet-derived growth factor subunit B
  • PIGF Placenta growth factor
  • TGF- ⁇ - ⁇ Transforming growth factor beta-1
  • TGF-a Protransforming growth factor alpha
  • FGF23 Fibroblast growth factor 23
  • Various other proteins may also be biomarkers in the methods of the present invention.
  • intracellular proteins involved in cell signalling pathways such as Myeloid differentiation primary response protein MyD88 (MYD88) and Fatty Acid binding protein (adipocyte) (FABP4) may be tested.
  • Heat shock proteins such as HSP-27
  • DKK1 Dickkopf-related protein 1
  • LAP latency-associated peptide
  • ESM-1 Endothelial cell- specific molecule 1
  • myoglobin haemoglobin
  • UGT1A1 KRAS
  • p53 p53
  • BRCA1 BRCA1
  • BRCA1 p16
  • CDKN2B p14ARF
  • MYOD1 MYOD1
  • CDH1 CDH13
  • S100 proteins such as Protein S100-A12 (EN-RAGE)
  • TM Thrombomodulin
  • PTX3 Pentraxin-related protein PTX3
  • cytochrome c nucleosomes
  • F-spondin also known as SPON-1
  • NF-kappa-B essential modulator NEMO
  • the method of the present invention may also be used in the detection of plaque proteins, include amyloid protein, tau protein. It may also be desirable to determine the abundance levels of isoforms of apoliprotein in a test subject according to the present invention. Peptides such as galanin may also be detected.
  • the biomarker may be selected from the list of biomarkers investigated in Example 1 below (see Table 4).
  • the biomarker may be selected from the list of biomarkers investigated in Example 2 (see Table 6).
  • the biomarker may be selected from the list of biomarkers investigated in Example 4 (see Table 9).
  • a subject or test subject and hence a control subject (that is a subject in a control population), may be any human or non-human animal subject, but particularly will be any mammalian organism.
  • the subject will be a human, but other subject or test subjects may be domestic or livestock animals, zoo animals, horses etc.
  • the subjects in the control population will be the same as the subject or test subject (in the sense of same species etc.).
  • a sample obtained from any bodily fluid or tissue may be used in the methods of the present invention.
  • the sample may thus be any clinical sample. It may thus be any sample of body tissue, cells or fluid, e.g. a biopsy sample, or any sample derived from the body, e.g. a swab, washing, aspirate or rinsate etc.
  • Suitable clinical samples include, but are not limited to, blood, serum, plasma, blood fractions, joint fluid, urine, semen, saliva, faeces,
  • the clinical sample is sample is blood or a blood-derived sample, e.g. serum or plasma or a blood fraction.
  • the present invention may be used to determine an individualised normal level of a biomarker for a test subject, or identify a biomarker, that is associated with the diagnosis or monitoring of any known disease or its treatment.
  • the disease may include any known clinical condition, syndrome or disorder, including clinical conditions or states which precede or presage overt or symptomatic disease, including notably for example cancer, autoimmune disease, neurological disorders e.g. neurodegenerative diseases, infectious disease, inflammation or any inflammatory disease or condition, connective tissue diseases, cardiovascular diseases or conditions or endocrine disorders.
  • the disease is a non-communicable disease.
  • Representative cancers include Acute Lymphoblastic Leukaemia (ALL), Acute Myeloid Leukaemia (AML), Adrenocortical Carcinoma, AIDS-Related Cancer (e.g. Kaposi Sarcoma and Lymphoma), Anal Cancer, Appendix Cancer, Astrocytomas, Atypical
  • Neuroendocrine Tumours Kaposi Sarcoma, Kidney Cancer (including Renal Cell and Wilms Tumour), Langerhans Cell Histiocytosis, Laryngeal Cancer, Leukaemia (including Acute Lymphoblastic (ALL), Acute Myeloid (AML), Chronic Lymphocytic (CLL), Chronic ALL
  • LCIS Carcinoma In Situ
  • Lymphoma Macroglobulinemia
  • Waldenstrom Melanoma
  • Merkel Cell Carcinoma Mesothelioma
  • Metastatic Squamous Neck Cancer with Occult Primary Midline Tract Carcinoma Involving NUT Gene
  • Mouth Cancer Multiple Endocrine Neoplasia Syndromes, Childhood, Multiple Myeloma/Plasma Cell Neoplasm, Mycosis Fungoides, Myelodysplastic Syndromes, Myelodysplastic/Myeloproliferative
  • Neoplasms Multiple Myeloma, Myeloproliferative Disorders, Nasal Cavity and Paranasal Sinus Cancer, Nasopharyngeal Cancer, Neuroblastoma, Non-Hodgkin Lymphoma, Non- Small Cell Lung Cancer, Oral Cancer, Oral Cavity Cancer, Oropharyngeal Cancer,
  • Osteosarcoma Osteosarcoma, Ovarian Cancer, Pancreatic Cancer, Pancreatic Neuroendocrine Tumours (Islet Cell Tumors), Papillomatosis, Paraganglioma, Paranasal Sinus and Nasal Cavity Cancer, Parathyroid Cancer, Penile Cancer, Pharyngeal Cancer, Pheochromocytoma, Pituitary Tumor, Plasma Cell Neoplasm/Multiple Myeloma, Pleuropulmonary Blastoma, Pregnancy and Breast Cancer, Primary Central Nervous System (CNS) Lymphoma,
  • CNS Central Nervous System
  • Prostate Cancer Rectal Cancer, Renal Cell (Kidney) Cancer, Renal Pelvis and Ureter, Transitional Cell Cancer, Retinoblastoma, Rhabdomyosarcoma, Salivary Gland Cancer, Sarcoma, Sezary Syndrome, Skin Cancer, Small Cell Lung Cancer, Small Intestine Cancer, Soft Tissue Sarcoma, Squamous Cell Carcinoma, Squamous Neck Cancer with Occult Primary, Metastatic, Stomach (Gastric) Cancer, T-Cell Lymphoma, Testicular Cancer, Throat Cancer, Thymoma and Thymic Carcinoma, Thyroid Cancer, Transitional Cell Cancer of the Renal Pelvis and Ureter, Urethral Cancer, Uterine Cancer, Endometrial, Uterine Sarcoma, Vaginal Cancer, Vulvar Cancer, Waldenstrom Macroglobulinemia, and Wilms Tumour.
  • autoimmune disease examples include rheumatoid arthritis, Graves' disease, Crohn's disease, autoimmune asthma, Addison's disease, motor neurone disease, multiple sclerosis, diabetes mellitus type 1 , lupus, eczema, rheumatic fever, thrombocytopenia, urticarial vasculitis and vasculitis.
  • neurological disorders include dementia, e.g. Alzheimer's disease, Parkinson's disease, Creutzfeldt-Jakob disease (CJD), cerebral palsy, motor neurone disease, aneurism, stroke (e.g. ischemic stroke), Ataxia telangiectasia (A-T), leukodystrophy, Huntingdon's disease, Pick's disease, Dawson disease, Guillain-Barre syndrome (GBS) and Wilson's disease.
  • dementia e.g. Alzheimer's disease, Parkinson's disease, Creutzfeldt-Jakob disease (CJD), cerebral palsy, motor neurone disease, aneurism, stroke (e.g. ischemic stroke), Ataxia telangiectasia (A-T), leukodystrophy, Huntingdon's disease, Pick's disease, Dawson disease, Guillain-Barre syndrome (GBS) and Wilson's disease.
  • CJD Creutzfeldt-Jakob disease
  • A-T Ataxi
  • infectious disease examples include any kind of infection in any tissue or region of the body, but notably sepsis, pneumonia, meningitis, typhus, tuberculosis, gastroenteritis, cellulitis and urinary tract infections.
  • Diseases can also include viral infections, for example infections with HIV, HBV, measles, influenza, and viral meningitis.
  • connective tissue disease examples include Marfan syndrome, osteogenesis imperfect, osteoarthritis, osteoporosis, rickets and scurvy.
  • cardiovascular disease examples include cardiac conditions e.g. angina, myocardial infarction, heart failure, cardiomyopathy, atherosclerosis, coronary heart disease, hypertension, cardiac dysrhythmias, endocarditis, myocarditis and rheumatic heart disease.
  • cardiac conditions e.g. angina, myocardial infarction, heart failure, cardiomyopathy, atherosclerosis, coronary heart disease, hypertension, cardiac dysrhythmias, endocarditis, myocarditis and rheumatic heart disease.
  • Examples of endocrine disorders include, Addison's disease, Adrenocortical carcinoma, Type 1 and Type 2 diabetes, gestational diabetes, hyperthyroidism,
  • hypothyroidism thyroidosis
  • diabetes insipidus hypopituitarism
  • hypogonadism hypogonadism
  • Diagnosis may be viewed as the process of identifying a subject's medical condition that allows decisions or choices to be made about the subject's treatment.
  • diagnosis may include identifying the disease in a subject.
  • a biomarker may also be used in prognosis, for example to predict the progress or development of a disease, or to predict it's response to treatment e.g. to a given therapy, for example to determine which therapeutic intervention (e.g. of a number of possible options) may be effective, or may work best (or be expected to work best) or be most appropriate. Responsiveness to treatment may also be monitored.
  • the aim of the invention is to determine an individualised normal level of a biomarker for a test subject.
  • a "normal" level is a level that may be expected, or may occur, or be determined in a test or control subject in the absence of the disease in question. "Normal” thus indicates absence of disease (specifically the disease under investigation).
  • the normal level e.g. from a control subject(s)
  • the adjusted or corrected normal level may thus be viewed as a clinical cut-off value which may be used to distinguish between the presence or absence of disease, for example to distinguish a test subject having the disease from a control subject not having the disease.
  • An individualised normal level is adjusted or corrected for the phenotypic and/or genetic factors particular to that test subject, and thus represents a personalised clinical cut-off value.
  • a level for the biomarker that is then determined for a test subject may thus be adjusted, or corrected, for the phenotypic and/or genetic covariates and compared to the individualised normal value for that test subject (i.e.
  • biomarker level as determined for the test subject
  • that biomarker level is deviant from the individualised normal level, and hence indicative of disease, disease status or prognosis etc. (e.g. treatment responsiveness or non-responsiveness, or disease progression etc.).
  • a biomarker may be detected in more than one type of clinical sample from a subject.
  • the biomarker levels may vary depending on the sample.
  • the clinical sample from the control subjects should be the same sample (i.e. same type of sample) as for the test subject, e.g. the subject suspected of having the disease or a subject with the disease.
  • the step of determining the abundance level of a biomarker in a sample may be performed by any means in the art.
  • the abundance level is simply a measure or indication of the level or amount of a biomarker in a given clinical sample, and may be assessed or determined in different ways, for example as an absolute or total amount of biomarker present, or as a concentration or ratio etc., or any other indication of level or amount .
  • the same measure or indicator or assessment of level or amount as is determined in step (a) of the method for determining the individualised level that is the step of determining the biomarker level in samples from the control population
  • the same means of determining abundance level be used for the control population and for the test subject, e.g. the subject suspected of having the disease or a subject with the disease.
  • a variety of different quantitative or semi-quantitative detection assays are known in the art for detecting different molecules, including any biomolecules that may be biomarkers. Any number of separation techniques may be used to determine the abundance level of a biomarker in a sample, for instance high performance liquid chromatography (HPLC), liquid chromatography, gel electrophoresis, or blotting techniques. Mass spectroscopy techniques, such as MALDI-TOF, ESI-MS or Tandem-MS may also be used, and may be combined with chromatographic techniques for sample analysis.
  • the abundance level of a biomarker may also be determined directly using an enzyme-base assay.
  • an enzyme-base assay For example, spectrophotometric, fluorometric, calorimetric, chemiluminescent or radiometric assays may be used in conjunction with suitable cofactors or substrates known in the art.
  • Similar functional tests may be used to determine the amount or level of any biomarker having a functional effect (e.g. a biological effect) that can be determined in a functional assay.
  • a functional effect e.g. a biological effect
  • biomarker level may be determined by using an affinity binding partner to bind to the biomarker (i.e. an affinity reagent for the biomarker).
  • an affinity binding partner i.e. an affinity reagent for the biomarker.
  • This may allow a biomarker to be separated from a sample, although such a separation step is not always necessary, and the biomarker may be detected and quantified (i.e. measured or the level determined) by determining the amount of affinity reagent bound.
  • antibody is used broadly herein to include any type of antibody, antibody fragments and antibody derivatives, including synthetic antibodies such as single chain antibodies, CDR-grafted antibodies, chimeric antibodies etc.
  • an immunoassay using an antibody is a preferred means of performing the biomarker level determination steps.
  • affinity reagents are known and used, including notably other proteinaceous affinity reagents as lectins, receptors, including immunological molecules such as T-cell receptors or antigen-binding molecules derived therefrom, and synthetic molecules such as affibodies, and proteins or peptides which may be identified by screening methods known in the art, e.g. by phage or other peptide display techniques.
  • affinity binding molecules include nucleic acids, e.g. aptamers, or other oligonucleotides capable of binding to a target molecule. Procedures for obtaining and identifying such nucleic acid based affinity reagents are known, e.g. Selex procedures etc.
  • Such immunoassay or analogous affinity reagent-based assays may be performed in various formats according to procedures and principles well known in the art, e.g. in solution (homogenous formats) or in solid phase-based formats, e.g. sandwich assays, competitive assays etc.
  • ELISA immunoPCR
  • immunoRCA immunoRCA assays.
  • the antibody or other affinity reagent may be detected in various ways, typically by labelling the reagent, directly or indirectly, with a signal-giving label (which may be detected directly or indirectly) or reporter molecule.
  • a wide variety of such labels and reporters and assays based on them are known in the art and any of these could be used.
  • labels may include simple colorimetrically or otherwise spectrophotometrically detectable labels (e.g. fluorescent labels, or any label which can be detected by any other means, e.g. detectable isotopes (e.g. radiolabels), colloidal materials, particles, quantum dots etc.
  • the reagents may be enzymically labelled or with enzyme substrates and the products of enzymic reactions may be detected.
  • the label may comprise a nucleic acid molecule which may be detected, most typically amplified and detected e.g. as in
  • a proximity assay may be used in the biomarker detection step(s).
  • a proximity assay relies on the principle of "proximity probing", wherein an analyte is detected by the binding of multiple (i.e. two or more, generally two or three) probes, which when brought into proximity by binding to the analyte allow a signal to be generated.
  • At least one of the proximity probes comprises a nucleic acid domain (or moiety) linked to the analyte-binding domain (or moiety) or the probe, and generation of the signal involves an interaction between the nucleic acid moieties and/or a further functional moiety which is carried by the other probe(s).
  • the signal generation is dependent on an interaction between the probes (more particularly between the nucleic acid or other functional moieties/domains carried by them) and hence only occurs when both the necessary (or more) probes have bound to the analyte, thereby lending improved specificity to the detection system.
  • Proximity probes of the art are generally used in pairs, and individually consist of an analyte-binding domain with specificity to the target analyte, and a functional domain, e.g. a nucleic acid domain coupled thereto.
  • the analyte-binding domain can be for example a nucleic acid "aptamer” (Fredriksson et al (2002) Nat Biotech 20:473-477) or can be proteinaceous, such as an antibody (Gullberg et al (2004) Proc Natl Acad Sci USA
  • the respective analyte-binding domains of each proximity probe pair may have specificity for different binding sites on the analyte, which analyte may consist of a single molecule or a complex of interacting molecules, or may have identical specificities, for example in the event that the target analyte exists as a multimer.
  • proximity probing has been developed in recent years and many assays based on this principle are now well known in the art, and many variations of proximity probe based assays exist, any of which can be used in the method of the present invention to determine the abundance level of a biomarker in a sample.
  • PHA proximity ligation assays
  • RCA rolling circle amplification
  • PAA proximity extension assays
  • the ligation or extension products generated in such assays may be detected, e.g.
  • a PEA is used, especially preferably a PEA wherein the extension product is determined by a quantifiable PCR method, thereby to quantify the biomarker in the sample.
  • proximity assays are described in WO 01/61037, US 6,51 1 ,809 and WO 2006/137932 and both heterogeneous (e.g. where the analyte is first immobilised to a solid substrate by means of a specific analyte-binding reagent) and homogeneous (i.e. in solution) formats for proximity probe based assays have been disclosed, e.g. WO 01/61037, WO 03/044231 , WO 2005/123963, Fredriksson et al (2002) Nat Biotech 20:473-477 and Gullberg et al (2004) Proc Natl Acad Sci USA 101 :8420-8424.
  • pairs of proximity probes are generally used, modifications of the proximity-probe detection assay have been described, in e.g. WO 01/61037, WO 2005/123963 and WO 2007/107743, where three proximity probes are used to detect a single analyte molecule.
  • PEA assay formats which may be used in the present invention are described in WO 2012/104261.
  • the abundance level of a biomarker may conveniently be determined by any of the above detection assays in a multiplexed format.
  • the multiplexed detection assay is a solid phase detection assay, and in a most preferred embodiment of the present invention is a multiplexed proximity extension assay. For example multiplexed sample detection may take place on an array.
  • microarray detection of a biomarker in a sample may be utilised, for example using the Olink Proseek Multiplex Oncology l 96x96 kit or the Olink Proseek Multiplex CVD l 96x96 kit.
  • step (a) of the method of the present invention the abundance level of a biomarker is determined in a control population in order to obtain a set of control abundance levels for said biomarker in a given or selected clinical sample in said control population.
  • the control population is typically a population of healthy subjects, more particularly subjects from the same species as the test subject.
  • the control population is free from the disease for which the test subject is being diagnosed, monitored or treated (or for which the candidate or putative biomarker is being tested or identified), in order for non-disease-related phenotypic and genetic covariates to be identified.
  • the size of the control population may vary depending on the biomarker, disease and available samples etc.
  • a minimum population size is usually in the order of 10 control subjects, and may in some cases be less, e.g. at least 5, 6, 7, 8, or 9, but typically will be at least 20, 30, 40, 50, 60, 70, 80, 90 or 100 control subjects, more typically at least 200, 300, 400 or 500 control subjects.
  • a minimum population size may also be at least 600, 700, 800, 900 or 1 ,000, control subjects, or at least 2,000, 5,000, 10,000, 20,000, 50,000 or 100,000 control subjects.
  • a control population may also be more than 100,000 control subjects.
  • the abundance level of a candidate biomarker is determined in a population of subjects with the disease in order to determine the adjusted value of the level of the candidate biomarker in subjects with the disease.
  • the size of the population may be defined as described above.
  • step (b) the control abundance levels are analysed to determine which phenotypic factors if any have an effect on the normal biomarker level, that is the level of biomarker in a normal (in the sense of non-diseased) subject i.e. which factors if any contribute to any variance observed, and the extent of such a contribution.
  • This analysis is performed using standard statistical analysis techniques, as described in more detail below.
  • One or more non-disease related phenotypic factors may be assessed with respect to biomarker abundance levels in the method of the present invention.
  • a phenotypic factor is any possible variable which may affect a subject, whether related to the individual subject(s) themselves, or to the study.
  • phenotypic factors may include anthropomorphic characteristics, clinical parameters including medication, lifestyle factors and sample round.
  • Anthropomorphic characteristics that may be assessed in the method of the present invention include age, gender, and size-related characteristics, including height, weight, hip size and waist size. Combinations of anthropomorphic characteristics or ratios thereof, such as hip-to-waist ratio or body-mass index (BMI) may be assessed in order to generate further phenotypic factors for use in the statistical analysis steps of the present invention.
  • BMI body-mass index
  • Clinical parameters that may be assessed in the method of the present invention include any parameter of clinical status or health of a subject, for example blood pressure (including systolic and/or diastolic blood pressure), blood group, one or more organ function tests, e.g. lung function, liver function, kidney function, heart function, neurological function, bone density, levels of test analytes, e.g. blood lipids (including cholesterol (i.e. VLDL, IDL, LDL and/or HDL levels) and blood fatty acids), metabolite levels e.g. in various samples, or allergens, or any combination thereof.
  • blood pressure including systolic and/or diastolic blood pressure
  • organ function tests e.g. lung function, liver function, kidney function, heart function, neurological function, bone density
  • levels of test analytes e.g. blood lipids (including cholesterol (i.e. VLDL, IDL, LDL and/or HDL levels) and blood fatty acids), metabolite levels
  • Clinical parameters may also include any medications that an individual is receiving, for example statins, insulin, chemotherapeutic agents, antibiotics, antivirals, antibody therapy, asthma medication, immunosuppressants (including steroids), blood pressure medication or medications for the treatment of any disease and painkillers. It may also be assessed whether or not a subject is pregnant.
  • test subjects may be desirable to assign test subjects to a particular blood group, according to the ABO system.
  • a B/O status of a test subject may be determined. It may also be desirable to subtype a test subject of the A group into A1 and A2 and the O group into O01 and O02.
  • the blood group of a test subject may be determined by conventional (i.e. serotological) testing, or by genetic testing, as many different alleles associated with A B/O status are known in the art.
  • a genetic test for identifying the blood group of a test subject may identify SNPs, insertions or deletions within the ABO gene. SNPs known in the art include rs505922, rs8176746, rs8176704 and rs574347.
  • Lifestyle factors that may be assessed in the method of the present invention include smoking, use of recreational drugs, alcohol consumption, diet and exercise, presence of household pets (e.g. applicable to an allergy etc.) and occupation.
  • the methods of the invention thus may include a step of determining or assessing one or more phenotypic factors for the subjects of a control population (for use in step (b)) and also for a test subject in step (f).
  • This may include steps of performing assays or measurements to determine or assess the control or test subjects for the presence, absence, or amount or level of a phenotypic factor.
  • one or more blood tests or analyses of other clinical samples from the subject may be performed, organ function tests or other clinical assessments may be performed, anthropomorphic measurements may be taken, clinical parameters e.g. blood group or blood pressure may be assessed, as well as gathering or assembling information on lifestyle factors.
  • genetic factors are also assessed, to identify any genetic covariates for a given biomarker and/or the extent of their contribution to the variance (step (d) together with optional step (c) of the method).
  • This analysis step is also performed using standard statistical methods, as described in more detail below. Such standard methods lead to the generation of a model which can be used to adjust or correct a normal (e.g.
  • Genetic data identifying genetic variants in a control population may be available, e.g. may have been pre-determined, depending on the control population used, for example a control population or other panel of subjects from a previous study. Such available data may be used directly in the analysis of step (c).
  • the method may include a step of performing one or more genetic tests on the control population in order to identify any genetic variants present in the control population. Genetic variants thereby identified are then analysed in step (d) to determine whether they have an effect on the level of a biomarker in a given clinical sample i.e. whether they contribute to, or explain any variance observed. Data obtained in such a genetic testing step may be used in combination with prior-determined genetic data.
  • Genetic testing methods used to detect genetic variants are well known in the art. Examples of genetic testing methods that may be used to detect genetic variants in the method of the present invention include whole-genome SNP analysis, whole exome sequencing, and whole genome sequencing. A wide range of sequencing technologies and platforms are now available, as are various techniques for detecting a particular genetic variant e.g. for detecting predetermined or known variants and any of these may be used. In particular genetic testing may be performed on a microarray, including the lllumina Infinium HapMap300v2 BeadChip, lllumina Human Omni Express BeadChip, and lllumina Human Exome Beadchip. Exome sequencing may be performed by using Agilent's SureSelect system for exome capture and the SOLiD 5500x1 instrumentation for sequencing.
  • step (f) may comprise a step of performing a genetic test on the test subject to determine the presence or absence of any one or more genetic variants identified as genetic covariates.
  • tests for detecting pre-determined genetic variants are well known, e.g. using probes or primers designed to identify a particular genetic sequence or variant. Thus such genotype
  • assessment tests may include the use of specific PCR primers or other variant specific amplification technologies e.g. LCR, NASBA etc, or hybridisation probes, e.g. padlock probes, molecular inversion probes, molecular beacons etc.
  • a step of sequencing a genomic sequence from the test subject may be performed.
  • Genetic variants may include single nucleotide polymorphisms (SNPs), deletions or insertions, copy number variations (CNVs), and structural variations (e.g. recombinations etc).
  • SNPs single nucleotide polymorphisms
  • CNVs copy number variations
  • structural variations e.g. recombinations etc.
  • combinations of genetic variants may be identified, and thus genetic variants may also comprise haplotypes, that is to say more than one genetic variant may be present within a particular chromosome, portion of a chromosome or locus that are found to affect the level of a biomarker in a test subject.
  • Genetic variants may be found in genic or intergenic regions of DNA.
  • Variants found in genic regions may be found in promoter or terminator sequences, or in exonic or intronic DNA. Variants may also be found in regions of non-coding DNA transcribed into non-transcribed RNA molecules, for instance rRNA, tRNA, miRNA or piRNA.
  • the present invention encompasses a method for detecting a biomarker in sample of body fluid or tissue from a subject, which method comprises a step of determining the presence or absence of a genetic variant selected from one or more of the genetic variants listed in Table 5 and Table 7.
  • the biomarker may be any one or more of the biomarkers listed in Table 5 or 7.
  • the present invention may provide a method of testing a subject for the presence or absence of any one or more genetic variants of Table 5 or Table 7. Such testing may be carried out in the context of determining the level of a biomarker (particularly a biomarker of Table 5 or 7) in a sample of body tissue or fluid from said subject. More particularly the method may include a step of assessing the effect of the genetic variant on the level of the biomarker in the sample and/or adjusting or correcting the biomarker level for the effect of the genetic variant.
  • Statistical analyses are performed in order to determine the effect of any phenotypic and/or genotypic factors on the abundance level of a biomarker.
  • the analyses may identify the contribution made by the covariates to the observed variance, or in other words identify the variables (covariates) that explain a proportion of the variance.
  • Statistical analysis may be performed using any of the known commercially or publically available software packages, including R, SAS (Statistical analysis software) or Statistica. Software suites associated with the R-package include GenABEL and ProABEL.
  • the methods include a step of identification of phenotypic factors which have an effect, particularly a significant effect, on the abundance level of a biomarker in a sample from a control subject.
  • Phenotypic factors may be identified by detecting statistical correlations between a given phenotypic factor and the abundance level of a biomarker in samples from one or more subjects e.g. control subjects.
  • a multiple linear regression analysis may be performed in order to determine the correlation between a phenotypic covariate and the abundance level of a biomarker. Any statistical test may be performed which can identify the covariates that explain a significant proportion of the variance seen in the measured biomarker level. For instance, the significance of each phenotypic covariate's contribution to the total variance can be estimated using an ANOVA-approach as
  • Any significance value may be used to judge whether a specific covariate has a significant effect for a specific biomarker or whether the difference (i.e. increase or decrease) between the adjusted values in the method of identifying a biomarker is significant.
  • a significance value of below 0.5, preferably below 0.4, or below 0.3, or 0.2 or 0.1 may be used. In a preferred embodiment of the present invention a significance value of less than 0.1 , preferably below 0.05, may be used.
  • P-values of below 0.05 may also include p-values of below 0.04, 0.03. 0.02 or 0.01 , or below.
  • P-values may be calculated in any of the ways known in the art.
  • a Bonferroni-adjusted p-value may be calculated when assessing whether a particular covariate has a significant effect for a specific biomarker.
  • a covariate might be considered significant for a specific biomarker if their Bonferroni-adjusted p-values were below 0.05.
  • the correlation between two biomarkers may also, if desired, be calculated in the method of the present invention in which an individualised normal value of a biomarker is determined. However, this step is not essential or important for generating the model.
  • Abundance levels of a biomarker in a population may be rank-normalised and correlations between pairs of biomarker abundance levels may be calculated on the adjusted rank- transformed values by applying Spearman's Rho statistics on pairwise complete
  • step (b) may result in the determination of the residual values for the control abundance levels.
  • the residual control abundance values may be used in their raw state, as determined in step (b), or they may optionally be normalised in an optional step (c).
  • the residual control biomarker abundance level values from step (b) which have been adjusted for the effect of the phenotypic covariates may be transformed in order to obtain a normal distribution.
  • the residual values may be rank-normally transformed, e.g. using "mtransform" function available from the R-package GenABEL.
  • the adjusted or corrected abundance level values from step (b) or the normalised values from step (c) are used in the step of genetic analysis in step (d) to identify genetic covariates that significantly affect the abundance level of a biomarker in a subject.
  • the abundance level values can then also be adjusted and corrected for effect of any genetic covariates.
  • the model comprises the statistical analyses of steps (b) and (d), which are used to assess the effect of the covariates (e.g. the extent of their contribution to the variance).
  • the statistical analyses of the phenotypic factors and/or genetic variants identified for or in and/or determined for the test subject are performed, essentially in the same or analogous or similar way as for the control population.
  • Methods, including software packages, for performing the analysis of genetic data to identify or to detect covariates are well known in the art and any number of different statistical analysis methods and packages may be used, for example plink, emmax, snptest,. Basically, any method of determining the statistical significance of a genetic variant may be used.
  • GenomeStudio (lllumina Inc)
  • the analysis of the genetic data may comprise a genome-wide association study (GWAS), according to techniques and principles well known in the art.
  • GWAS genome-wide association study
  • This analysis step may also comprise a step of imputation of the genetic data. Again methods and software for this are known and available, for example plink, emmax, snptest, Impute2 or Shapeit.
  • Statistical analysis may be performed on the imputed genetic data using any of the above- referenced statistical software packages, which may include functions for estimating heritability (h 2 ) and performing genetic association analysis by adjusting for pedigree structure, e.g. Gen ABEL and ProABEL.
  • functions for estimating heritability (h 2 ) and performing genetic association analysis by adjusting for pedigree structure e.g. Gen ABEL and ProABEL.
  • h 2 heritability
  • ProABEL ProABEL
  • the model generated thereby may be used to calculate the individualised normal level of a biomarker for a test subject.
  • the abundance level of said biomarker in said test subject is determined, and the phenotypic and genetic covariates of said test subject are assessed.
  • the determination of the abundance level of said biomarker and phenotypic and genetic analyses will be performed using the same method as for each of the members of the control population.
  • the abundance level of said biomarker determined for the test subject may be adjusted according to the model generated in the method of the present invention to calculate an adjusted value for the abundance level of said biomarker, based on said test subject's individual phenotypic and genetic covariates. This may then be compared to the prior- determined individualised cut-off value.
  • Figure 1 shows the characteristics of the PEA-measurements.
  • A Intensities of PEA values and proportion of proteins and individuals above detection limit. In the heatmap, individuals are in columns and proteins in rows. Heatmap colors represent ddCq-values ranging from low (blue) to high (yellow) with measurements below detection limit coded white.
  • B Significant covariates in relation to each protein. Covariates are listed from the upper right part of the circle (12 o'clock to 4) and connections illustrate significant (p-value ⁇ 0.05, Bonferroni adjusted) contributions to PEA variance.
  • C PEA to PEA correlations, colored connections represent a correlation coefficient (R 2 ) greater than 0.5. The width of the connection reflects the magnitude of the squared correlation coefficients. All correlations coefficients (R) were positive.
  • Figure 2 shows Manhattan plots of GWAS results.
  • A IL-6RA
  • B CXCL5
  • C CCL24
  • D E-selectin
  • X-axis labels refer to human chromosomes listed 1 -22 and X. P-values were calculated from 1 df Wald statistics chi-square values using 971 individuals.
  • Figure 3 shows covariates and protein biomarkers.
  • A Variance explained by each of the covariates for the set of 77 biomarkers with measurable variability with the 1 1 most important covariates colored. The combined effect of the remaining covariates is shown in grey, assuming independence in effect between covariates.
  • B The percent of the variance explained by the full set of covariates studied for the 77 proteins, using a combined model.
  • C Abundance of CXCL10, expressed as ddCq-values, in relation to age when stratified by genotype at rs1 1548618; AA (grey) AB (red) and BB (blue). Shadowed areas represent the 95% confidence interval in a linear model predicting ddCq from age.
  • FIG. 4 shows the number of significant epidemiological associations in proteins with significant case-control difference using PNPPP and unadjusted (Raw) abundance levels.
  • E Stroke (PNPPP/unadjusted) 6 and 51.
  • Figure 5 shows the case-control differences using unadjusted (Raw) and the PNPPP method. Absolute differences in mean value (case - control) in A) Cataract, B) Diabetes, C) Hypertension, D) Myocardial Infarction, E) Stroke.
  • X-scale is in log2 PEA values, all 125 proteins are stacked, sorted by mean difference in Raw values (left side). Corresponding PNPPP-differences are drawn on the right side. Values below the dashed grey lines have negative sign, e.g. control values are higher than in cases. Black colour indicates significant difference (two-sided Ranked Wilcox test, p-value ⁇ 8 x 10-5).
  • Figure 6 shows examples of using PNPPP of determining biomarker cutoffs.
  • A-D) Solid lines represent PNPPP-values and dashed raw values. Blue values are controls and red represent cases. Grey bars (right y-axis) depict disease incidence in %.
  • PEA Proximity Extension Assay
  • the Northern Sweden Population Health Study was initiated in 2006 to provide a health survey of the population in the parish of Karesuando, county of Norrbotten, Sweden, and to study the medical consequences of lifestyle and genetics.
  • This parish has about 1 ,500 inhabitants who meet the eligibility criteria in terms of age ( ⁇ 15 y), of which 719 individuals participated in the study (KA06 cohort).
  • As a second phase of the NSPHS another 350 individuals from a neighboring village (Soppero) were recruited in 2009 (KA09 cohort).
  • blood samples were taken (serum and plasma) and stored at -70°C on site. Both the 2006 and 2009 samples used in this study have undergone 2 freeze-thaw cycles prior to the measurements carried out here.
  • Uppsala (Regionala Etikprovningsnamnden, Uppsala, 2005:325) in compliance with the Declaration of Helsinki. All participants gave their written informed consent to the study including the examination of environmental and genetic causes of disease. In cases where the participant was not of age, a legal guardian signed additionally. The procedure that was used to obtain informed consent and the respective informed consent form has recently been discussed in light of present ethical guidelines.
  • Protein levels in plasma were analyzed using the Olink Proseek Multiplex Oncology 1 96x96 kit and quantified by real-time PCR using the Fluidigm BioMarkTM HD realtime PCR platform as described earlier (Assarsson, E. et al. 2014. PLoS One 9, e95192).
  • a pair of oligonucleotide-labelled antibodies probes bind to the targeted protein and if the two probes are in close proximity a PCR target sequence is formed by a proximity-dependent DNA polymerization event and the resulting sequence is subsequently detected and quantified using standard real-time PCR.
  • dCq Ma is a per-assay value defined by the manufacturer to give a positive log2-scale.
  • a list of the 92 proteins quantified by the PEA is shown below in Table 4. The ddCq values where then log2-transformed for subsequent analysis.
  • Each PEA (proximity extension assay) measurement has a specified lower detection limit calculated based on negative controls that are included in each run and measurements below this limit were removed from further analysis.
  • the KA06 and KA09 cohorts have previously been genotyped on the lllumina Infinium HapMap300v2 BeadChip (308,531 markers) and lllumina Human OmniExpress BeadChip (731 ,442 markers) arrays respectively as described earlier (Johansson, A. et al. 2013. Proc. Natl. Acad. Sci. USA 1 10, 4673-4678).
  • the specific KA06 and KA09 data was quality checked separately leaving 691 individuals with 306,086 SNPs at 99.50% genotyping rate and 346 individuals 631 ,503 SNPs at 99.88% genotyping rate respectively. 4 individuals were present in both cohorts and these were removed from the KA06 data.
  • Reference calls for SNPs were made if there were at least 3 reference sequence reads with unique start points and a maximum of 5% reads with non-reference at that position.
  • Reference calls for INDELs were made if there were no reads at all without the reference call. All other calls were set to missing. We then required at most 5% missing call rate per SNP or INDEL. This resulted in 83'568 SNPs with non-zero MAF at 98.74% total genotyping rate and 38'290 INDELs with a total genotyping rate at 99.45 % and an additional 350k positions with reference calls only.
  • the paraautosomal and non-paraautosomal regions on chromosome X were handled separately.
  • the resulting data was filtered on marker level by requiring IMPUTE's 'info' score >0.3 in both the KA06 and KA09 cohorts before merging.
  • Merging of the imputed data was done using GTOOL (vO.7.5) (Freeman, C, Marchini, J. 2013.
  • GTOOL http://www.well.ox.ac.uk/ ⁇ cfreeman/software/gwas/gtool.htmo)
  • the resulting merged data was further filtered using QCTOOL (v1 .3) (Band, G., Marchini J.
  • Covariates were considered significant for a specific protein if their Bonferroni-adjusted p-values were below 0.05 (p-value ⁇ 3.16 x 10 "4 , 0.05/158).
  • Each PEA measurement was individually adjusted for significant covariates and rank-transformed to normality by using the 'rntransform' function available from the R- package GenABEL (v1 .6.7) (Aulchenko, Y. S. et al. 2007. Bioinformatics 23, 1294-1296). Correlations between pairs of PEA measurements were carried out, on the adjusted and rank-transformed values, using the 'cor' function applying Spearman's Rho statistics on pairwise complete observations.
  • the NSPHS is a population based study and includes many relatives and special care has to be attributed to avoid relational biases. Therefore, all genetic associations calculations was carried out using the GenABEL or ProbABEL (Aulchenko, supra) software suites, which has been developed to enable statistical analyses of genetic data of related individuals. These packages includes functions for estimating the narrow-sense heritability (h 2 ) and performing genetic association analyses Chen, W. M., Abecasis, G. R. 2007. American journal of human genetics 81 , 913-926) by adjusting for pedigree structure. In brief, the heritability of each trait (protein abundance) is estimated using a polygenic model as implemented by the 'polygenic' method in the GenABEL R-package.
  • This heritability estimate represents the variance in the phenotype that is explained by genetic factors and is estimated by maximizing the likelihood of the trait-data under a polygenic model including fixed effects such as covariates and relatedness among individuals (kinship).
  • the KA06 cohort was used as discovery cohort in the genome-wide association studies (GWAS) and KA09 as replication cohort. Since we cannot rule out protein degradation effects due to differences in storage time between the two cohorts, this split is favorable to a random split where degradation effects could affect the association analysis.
  • Strict Bonferroni-adjusted p-values (p-value ⁇ 1 .03 x 10 "8 , 0.05/4,840,842) were used to report significance in the discovery cohort and the replication cohort (p-value ⁇ 0.05 / number of significant SNPs in the discovery cohort).
  • SL-1 Stromelysin- 1
  • GM-CSF Granulocyte-macrophage colony-stimulating factor
  • ER Estrogen receptor
  • CA242 Cancer Antigen 242
  • IL-2 lanterleukin-2
  • EPR Epiregulin
  • BTC Betacellulin
  • IL-4 lnterleukin-4
  • IFN-gamma Interferon gamma
  • IL-7 (lnterleukin-7), TNF (Tumor necrosis factor)
  • CEA Carcinoembryonic antigen-related cell adhesion molecule 5
  • MYD88 Myeloid differentiation primary response protein MyD88
  • MUC-16 Moct-16
  • REG-4 Regenerating islet-derived protein 4
  • WFDC2 WAP four-disulfide core domain protein 2
  • IL-12 lamino-12
  • WFDC2 WAP four-disulfide core domain protein 2
  • IL-12 lamino-12
  • the lllumina Body Map suggests that CD69 and Caspase-3 both are expressed in leukocytes, lymph nodes and adrenal glands (e.g. 3 of 16 investigated tissues).
  • Several of the 12 pairs that were highly correlated were proteins with similar functions, such as CXCL9, 10, 1 1 , and TNF-R1 and TNF-R2, while in other cases apparently unrelated proteins were highly correlated. These correlations may reflect as yet unknown patterns of co-regulation, and bring into question their value as independent biomarkers.
  • each of the 77 proteins was adjusted for the significant clinical and lifestyle variables (Table 1 ) and the samples were split into a discovery and a replication cohort based on sample collection round (see Methods for details).
  • Table 1 the samples were split into a discovery and a replication cohort based on sample collection round (see Methods for details).
  • the discovery phase we identified 15 proteins with genome-wide significant hits (nominal p-value down to 1 .1 x 10 "40 , Table 2), employing a Bonferroni corrected p-value cut-off of 0.05. Of these, 14 had at least one replicated association (nominal p-value down to 1 .1 x 10 "20 , Table 2). In all, 175 genome-wide significant hits were detected in the discovery phase, out of which 101 replicated.
  • the fraction of variance explained by the second-ranking SNPs was small compared to the top-ranking SNP.
  • the top SNPs were located in cis with the gene encoding the protein.
  • CCL19 is a chemokine implicated in inflammatory and immunological responses, but also in normal lymphocyte recirculation and homing. Higher serum levels of CCL19 have been associated with poor prognostics of AIDS patients. For E-selectin, the circulating level is known to be affected by ABO blood group.
  • Basigin expression has been associated with shorter survival and proposed as a biomarker for adjuvant therapy in colorectal cancer.
  • Our analysis did not show any significant association of Basigin levels with covariates such as anthropometrics, age, sex or smoking.
  • Basigin levels with covariates such as anthropometrics, age, sex or smoking.
  • glucocorticoids commonly found in inhalators used to treat asthma-related conditions, decreased circulating levels of Basigin thereby possibly masking the need for adjuvant treatment.
  • Our results indicate that when using Basigin as a biomarker in an ageing population, medication history and dosage should be taken into account in order to establish an appropriate clinical cut-off.
  • IL-6 and IL-6 receptor are used to treat e.g. hypertension
  • medications used to treat e.g. hypertension such as dihydropyridine derivatives, but not ACE-inhibitors or selective beta-blockers agents, cause or maintain an increase in the inflammatory response cascade via high IL-6 levels.
  • the IL-6 signaling is important in the pathogenesis of several autoimmune and chronic inflammatory diseases and antibody based drugs are used to target the IL-6 receptor in patients with rheumatoid arthritis (RA) in order to dampen the inflammatory response.
  • RA rheumatoid arthritis
  • circulating levels of CXCL10 have been estimated to 120 ⁇ 83 pg/ml in patients diagnosed with Graves' Disease as compared to 72 ⁇ 32 pg/ml in controls; an average increase of 67% not taking genetic and non-genetic covariates into account.
  • the average increase in individuals in our study carrying the reference genotype for rs1 1548618 was 178% (linearized ddCq) of the level in heterozygous individuals, clearly illustrating the relative importance of carrier genotype versus the disease state on biomarker levels.
  • the worldwide minor allele frequency of rs6946822 is listed in dbSNP as 0.46, implying that every 5 th individual will be homozygote, similar in frequency to the individuals who smoke in the U.S today, demonstrating the large, common genetic effects on biomarker variation found in the population today.
  • biomarkers that are not significantly affected by any of the variables examined, rendering them less susceptible to variability induced by non-disease related factors. Although we have investigated a large number of genetic, clinical and lifestyle factors, they altogether explain at most 56% of the variation in biomarker levels between individuals. The remaining variance must reflect other factors, or non-additive interaction between some of the factors studied, and their identification could further increase the utility of biomarkers by reducing sources of variation unrelated to disease state. For example, CCL24 had a heritability of 0.78, indicating that additional genetic loci might affect protein levels. For 15 of the biomarkers the vast majority of abundances were below the detection limits in our cohort. Several of these could represent ideal biomarkers without major presence in normal plasma and thus with no influencing genetic or lifestyle factors.
  • MUC-16 (or CA125) that is used clinically as a test for ovarian cancer
  • potential biomarkers such as REG-4 that has been proposed as a biomarker for pancreatic ductal adenocarcinoma.
  • CML chronic myelogenous leukemia
  • NSCLC non-small-cell lung cancer
  • a gene-fusion mutation has higher drug response rates than those lacking this gene-fusion.
  • NSCLC non-small-cell lung cancer
  • the number of cancer biomarkers in clinical use is still limited. In the set of biomarkers studied here, we identified a surprisingly strong genetic effect on some biomarkers after correcting for clinical (medication) and lifestyle variables.
  • biomarkers were strongly affected by environmental lifestyle or clinical factors. Genotyping of selected polymorphisms with a strong effect on abundance appears to be crucial for about 20% of the biomarkers in our study, while lifestyle and medication are important covariates for the majority.
  • analysis of broad-spectrum biomarkers could be used as a follow-up analysis for patients, or for screening of risk groups. Our analyses indicate that such tests would be accompanied by collecting additional relevant information such as anthropometrics, medication and genotyping of specific polymorphisms known to affect the baseline of these biomarkers.
  • the clinical laboratory that performs the biomarker analysis would have documentation on which cofactors that significantly influence the baseline levels, and could advise the physician on how to interpret the outcome of the test.
  • biomarker-specific covariate profiles will make it possible to determine more precisely, individualized, clinical cut-off levels. This in term could lead to a more efficient use of protein biomarkers for early detection of abnormal levels and for increased sensitivity and specificity in disease diagnosis.
  • biomarker-specific profiles of covariates it will be possible to fully harness the potential of existing and novel biomarkers for disease diagnosis and management.
  • Example 2 Assessment of the effect of phenotypic and genetic factors on the abundance levels of proteins associated with cardiovascular disease.
  • Example 3 Example calculations of an individualized model.
  • a model which is capable of adjusting the abundance level of a biomarker in a sample for the effect of the phenotypic and/or genetic covariates.
  • This model may subsequently be used to determine a value for the normal level for the abundance of a biomarker in a sample taken from a test subject.
  • An example of such a model (for FAPB4) is shown below.
  • a 'new' value can be calculated from the measured (or 'observed') value for a biomarker in a test subject, adjusted to take into account phenotypic and genetic covariates.
  • partOfSignalExplainedByNonDiseaseModel -0.03638 * [length in cm] + 0.03613[weight in kg] + 0.36965 * [sex, female yes/no]
  • PNPPP personally normalized plasma protein profiles
  • the NSPHS cohort represents a cross-section of the inhabitants in the north rural areas of Sweden and thus participants suffering from any non-communicable disease were not excluded from the study.
  • the endpoints in this study are the self-reported diseases/conditions of Cataract, Diabetes (both type I and II), Myocardial infarction, High blood pressure and Stroke.
  • the frequencies of these and baseline anthropometrics are reported in Table 8.
  • the overlap between the individuals self-reporting multiple diseases are shown in Table 10.
  • PTX3 Taxin-related protein PTX3
  • ITGB1 BP2 Integrin beta-1-binding protein 2
  • PAPPA Pappalysin-1
  • NT- pro-BNP N-terminal pro-B-type natriuretic peptide
  • BNP Netriuretic peptides B
  • the KA06 and KA09 cohorts have been genotyped as described in Example 1. The two cohorts were imputed separately as outlined above.
  • the input data was phased chromosome-wise using SHAPEIT (v2.r727).
  • the reference panel used was the autosomal 1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) (National Center for Biotechnology Information build b37, Dec 2013) accessed from the IMPUTE Web resource.
  • the resulting data was filtered as outlined above.
  • the final dataset included 8'506'190 SNPs and INDELs.
  • the heritability of protein abundance levels was estimated by evaluating the co-segregation of protein levels with the relatedness among individuals, using a polygenic model (see Methods for details). For 98 of the proteins the levels were significantly heritable (Bonferroni adjusted p-value ⁇ 0.05), with heritability estimates ranging from 0.19 to 0.74.
  • the GWAS's yielded genome-wide significant hits for 10 of the proteins unique to this study; 9 of these represent previously unreported associations whilst we replicate the IL1R1 loci previously reported for ST2.
  • the top genetic associations for the proteins unique to this study are listed in Table 7. Including the genetic hits described in Example 1 and shown in Table 3, 24 of 125 proteins with detectable levels had genome- wide significant hits. 2970 hits for all 24 proteins were identified, which included annotations and overlaps with previously reported associations according to the NHGRI's GWAS catalogue.
  • Cataract is defined as either congenital or age-related with the latter being the most common with over 75% of the cases. Apart from age, prolonged steroid treatment or exposure to sunlight contribute to disease progress. Using unadjusted levels, 37 proteins showed significant differences between patients and controls, while PNPPP identified 8 proteins with increased levels compared to controls, and one protein with lower levels (Table 9). Six of the 9 proteins were adjusted for one or more significant covariate. MMP-12 (matrix metallopeptidase 12) was associated with higher levels in the cases and affected by 5 factors; age, systolic blood pressure (SBP), waist circumference, genetic factors (Table 7) and length.
  • SBP systolic blood pressure
  • EGFR Extracellular growth factor receptor
  • Diabetes type II is associated with age and lifestyle factors, such as weight (obesity), diet and exercise. This is consistent with the diabetes cases in our cohort, which are older and have a higher BMI than the controls (Table 8).
  • 22 proteins showed significant differences between patients and controls, while use of PNPPP identified 4 proteins with higher abundance levels in cases than controls (Table 9).
  • MMP-10 matrix metallopeptidase 10
  • MMP-10 matrix metallopeptidase 10
  • MMP-10 matrix metallopeptidase 10
  • GDF- 15 also showed higher levels in cases compared to controls.
  • GDF-15 levels were significantly associated with length, weight, waist circumference, SBP, age, sample round, pregnancy and bile acid preparations (A05AA).
  • GDF-15 has been proposed as a biomarker for cardiovascular disorders and has been shown to be correlated with a multitude of metabolic and anthropometric parameters including age, waist circumference, mean arterial pressure, fasting glucose, and fasting insulin in an obese cohort. The same study also found significantly increased levels of plasma GDF-15 in obese individuals compared to a control group and specifically in obese individuals with type II diabetes compared to obese individuals with normal glucose tolerance.
  • Blood pressure can be lowered through medication of by changes in lifestyle; exercise, reduction of tobacco use and alcohol consumption, losing weight, changing food habits and reducing stress. Other factors such as sex, age, as well as genetic variants also impact blood pressure.
  • 63 proteins showed significant differences between patients and controls, while use of PNPPP identified 32 proteins with higher abundance levels in cases than controls and 2 with lower levels (Table 9). The strongest association with hypertension was found for SPON1 (Spondin 1 ) (p-value ⁇ 8.7 x 10 "18 ), with no additional significant covariates.
  • SPON1 has been proposed as a candidate gene for hypertension in hypertensive rats based on gene expression studies. Renin (REN) also showed significantly higher levels in cases after correcting for sex and medication (proton pump inhibitors (ATC:A02BC) and Digitalis glycosides (ATC:C01AA)). For six proteins there was a significantly higher number of observations above the detection limit in the cases as compared to controls (Table 9). Among these proteins were BNP (p- value 2.8 x 10 "20 ) and NT-pro-BNP (p-value 4.9 x 10 "11 ), both previously shown to be associated with risk of hypertension. Kidney failure may also result in elevated BNP and NT- pro-BNP levels and a strong indicator of this condition is blood creatine levels. However, in our cohort there was no difference (p > 0.05, Wilcoxon rank sum test) in the frequency of elevated levels between individuals with BNP or NT-pro-BNP above, or those below detection level.
  • AMI Acute myocardial infarction
  • GDF-15 Crowth Differentiation Factor 15
  • MMP-10 MMP-10
  • FGF-23 Fibroblast growth factor 23
  • Higher levels of FGF-23 have been associated with mortality and cardiovascular events in patients with chronic kidney disease.
  • FGF-23 regulates serum phosphate levels where high levels of serum phosphates triggers FGF-23 production, and elevated serum phosphate levels is common in patients after AMI and is associated with poorer prognostics.
  • BNP p-value 1 .9 x 10 "21
  • NT-pro-BNP BNP p-value 3.3 x 10 "7
  • REG-4 p-value 1 .9 x 10 "7
  • Biomarkers for stroke are scarce and often focus on the diagnosis, prediction of severity and therapy selection of ischemic stroke, although some efforts have been made to search for biomarkers differentiating between ischemic and haemorrhagic stroke.
  • 20 proteins showed significant differences between patients and controls, while use of PNPPP identified 6 proteins, all with higher abundance levels in cases than controls (Table 9).
  • Two proteins were found only using the PNPPP: PIGF (Placental growth factor) and CXCL13 (C-X-C motif chemokine 13).
  • IL-6 Interleukin 6
  • mRNA-levels of Amphiregulin (coded by the AREG gene) and CXCL13 (CXCL13) have shown associations with ischemic stroke. Increased AREG expression has also been observed in ischemic stroke with hemorrhagic transformation and increased transcription of CXCL13 has been seen in response to ischemia.
  • BNP (Table 9), there was a significantly (p-value 2.1 x 10 "8 ) higher fraction of observations above detection limit in the cases relative to the controls. BNP has previously been shown to have elevated levels after ischemic stroke.
  • a second example is the abundance levels of TIM (hepatitis A virus cellular receptor 1 ), that differed significantly between Cataract cases and controls, but does not differ after normalization for weight, SBP, age, waist, length, usage of Insulins and analogues for injection, fast-acting (ATC: A10AB) and genetic factors (Table 7) (Figure 6B).
  • TIM is currently used as a biomarker for proximal tubular injury in renal diseases but has not been linked to cataract, suggesting that the associations seen here using the raw values were due primarily to differences in age, SBP and weight between cataract cases and controls. Similar patterns were seen for GDF-15 (Figure 6C) in relation to Diabetes and SBP and for Growth Hormone in relation to Hypertension and weight (Figure 6D).
  • the diseases examined are all relatively common and a number of the individuals carry diagnoses for several of the diseases.
  • the fraction of cases only diagnosed with one disease varies from 16% for Stroke to 60% for Hypertension (Table 10).
  • the substantial fraction of individuals with multiple diagnoses implies that some biomarkers could be shared between disease groups. Indeed, among the proteins identified by PNPPP analysis there are examples of such cross-sharing biomarkers.
  • the small number of individuals remaining when requiring no overlaps between end-points reduces the statistical power to detect proteins with case-control differences, nevertheless, for Cataract and Hypertension 4 and 23 proteins out of the 9 and 33 proteins originally found with significant case-control differences retain their significance when restricting to single end-points (Table 9).
  • the PNPPP procedure provides advantages by limiting the number of covariates included in the analysis and providing a set of protein candidate biomarkers for further validation whose variability is less affected by factors unrelated to disease.
  • An inherent complication to the study of common diseases is that individuals may belong to several of the endpoint categories, reflecting the fact that especially elderly individuals are diagnosed with multiple diseases. This is partly addressed by incorporating the use of medications in the models where any effect of a medication for a partially overlapping disease would be accounted for in both the cases and the controls.
  • the PNPPP procedure can aid in the clinical application of protein biomarkers.
  • anthropometrics and lifestyle related variables are strong risk factors.
  • One such example is age or SBP for cataract.
  • SBP has a p- value (7.7 x 10 "13 ) in parity with the best discriminating protein using raw values (TIM-1 , 1 .6 x 10 "15 ), and much lower than the best protein using the PNPPP values (EGFR, 5.4 x 10 " 08 ). From Figure 6B (grey bars) it is clear that the cataract frequency increases with SBP up to a certain point, but also that none of the SBP-groups have more than 20% cataract incidence.
  • age and gender matched controls and allows for a more efficient use, and re-use, of control cohorts.
  • the results will also impact on how the biomarkers are used clinically. Either the physician will set a cut-off depending on a predefined set of prerequisites, such as age, gender or ethnicity, or use a computer aid to recalculate the value based on models generated from non-affected individuals. The former system quickly becomes unfeasible when several factors need to be accounted for or when non-categorical variables such as age or weight are use.
  • Table 2 GWAS results. 1 Heritability estimate. 2 Fraction of variance explained in the adjusted and transformed phenotype by the top-ranking SNP (SNP with lowest p-value in the combined analysis). 3 Estimation of the inflation factor for the resulting distribution of p-values. P-values 5 were calculated from 1 df Wald statistics chi-square values.
  • Table 4 A list of the 92 proteins quantified by PEA in Example 1 .
  • CA242 CA242 tumor marker Cancer Antigen
  • ErbB2 Receptor tyrosine-protein kinase ErbB-2 P04626 ERBB2 HUMAN ERBB2 ENSG00000141736
  • ErbB3 Receptor tyrosine-protein kinase ErbB-3 P21860 ERBB3 HUMAN ERBB3 ENSG00000065361
  • ErbB4 Receptor tyrosine-protein kinase ErbB-4 Q15303 ERBB4 HUMAN ERRB4 ENSG00000178568
  • GDF-15 Growth/differentiation factor 15 Q99988 GDF15 HUMAN GDF15 ENSG00000130513
  • HGF Hepatocyte growth factor P14210 HGF HUMAN HGF ENSG00000019991
  • HGF receptor Hepatocyte growth factor receptor P08581 MET HUMAN MET ENSG00000105976 hK1 1 Kallikrein-1 1 Q9UBX7 KLK1 1 HUMAN KLK1 1 ENSG00000167757
  • MIA Melanoma-derived growth regulatory Q16674 MIAJHUMAN MIA ENSG00000261857 protein
  • PECAM-1 Platelet endothelial cell adhesion P16284 PECA1JHUMAN PECAM1 ENSG00000261371 molecule
  • PRSS8 Prostasin Q16651 PRSS8 HUMAN PRSS8 ENSG00000052344
  • TGF-alpha Transforming growth factor alpha P01 135 TGFA HUMAN TGFA ENSG00000163235
  • TNF-R1 Tumor necrosis factor receptor 1 P19438 TNR1A HUMAN TNFRSF1A ENSG00000067182
  • TNF-R2 Tumor necrosis factor receptor 2 P20333 TNR1 B HUMAN TNFRSF1 B ENSG00000028137
  • TNFRSF4 Tumor necrosis factor receptor P43489 TNR4JHUMAN TNFRSF4 ENSG00000186827 superfamily member 4
  • VEGF-A Vascular endothelial growth factor A P15692 VEGFA HUMAN VEGFA ENSG000001 12715
  • VEGFR-2 Vascular endothelial growth factor P35968 VGFR2 HUMAN KDR ENSG00000128052 receptor 2
  • AM Adrenomedullin
  • AGRP Agouti-related protein
  • T I E2 Angiopoietin-1 receptor
  • Beta-nerve growth factor Beta-nerve growth factor
  • CSD Cathepsin D
  • CSP-8 Caspase-8
  • CTL20 Cathepsin L1 (CTSL1 ) C-C motif chemokine 20 (CCL20)
  • CTL3 C-C motif chemokine 3
  • CCL4 C-C motif chemokine 4
  • CD40 ligand CD40L
  • CHI3LI Chitinase-3-like protein 1
  • CXCL1 C-X-C motif chemokine 1
  • CXCL6 C-X-C motif chemokine 6
  • CXCL16 C-X-C motif chemokine 16
  • CSTB Cystatin-B
  • Dkk-1 Dickkopf-related protein 1
  • ESM-1 Endothelial cell-specific molecule 1
  • ECP Eosinophil cationic protein
  • EGF Epidermal growth factor
  • Fibroblast growth factor 23 FGF-23) Follistatin (FS)
  • Galectin-3 Galectin-3 (Gal-3) Growth hormone (GH)
  • GDF-15 Growth/differentiation factor 15
  • HSP 27 Heat shock 27 kDa protein
  • HGF Hepatocyte growth factor
  • IL-1 receptor antagonist protein IL- lnterleukin-18 (IL-18)
  • IL27-A lnterleukin-27 subunit alpha
  • IL-4 IL-4
  • IL-6 receptor subunit alpha IL- lnterleukin-6 (IL-6)
  • IL-8 lnterleukin-8
  • hK1 1 Kallikrein-1 1
  • Kallikrein-6 Lectin-like oxidized LDL receptor 1 (LOX-1 )
  • MMP-1 Matrix metalloproteinase-1
  • MMP-10 Matrix metalloproteinase-10
  • MMP-12 Matrix metalloproteinase-12
  • MMP-3 Matrix metalloproteinase-3
  • MMP-7 Matrix metalloproteinase-7 (MMP-7) Melusin (ITGB1 BP2)
  • MCP-1 Monocyte chemotactic protein 1
  • MPO Myeloperoxidase
  • MB Myoglobin
  • BNP Natriuretic peptides B
  • NEMO NF-kappa-B essential modulator
  • N-terminal pro-B-type natriuretic peptide (NT-1)
  • pro-BNP Ovarian cancer-related tumor marker CA 125
  • PAPPA Pappalysin-1
  • Pentraxin-related protein PTX3 (PTX3) Placenta growth factor (PIGF)
  • PCAM-1 Platelet endothelial cell adhesion molecule Platelet-derived growth factor subunit B (PDGF subunit B)
  • IL16 lnterleukin-16
  • PRL Prolactin
  • Protein S100-A12 (EN-RAGE) Proteinase-activated receptor 1 (PAR-1 )
  • Resistin SIR2-like protein (SIRT2)
  • SCF Stem cell factor
  • TM Thrombomodulin
  • TIM-1 Tissue factor (TF)
  • Tissue-type plasminogen activator (t-PA)
  • TNF-related apoptosis-inducing ligand TRAIL
  • TRAIL-2 receptor 2
  • TNF-R1 Tumor necrosis factor receptor 1 (TNF-R1 ) member 14 (TNFSF14)
  • TNF-R2 Tumor necrosis factor receptor 2
  • FAS Urokinase plasminogen activator surface member 6
  • VEGF-A Vascular endothelial growth factor A
  • Table 8 Baseline data for the five non-communicable diseases in the study cohort Example 4.
  • Table 9 Significant associations of protein abundance levels with disease status identified using the personally normalized plasma protein profiles (PNPPP) methodology in Example 4. Bold faced proteins indicate association only seen using PNPPP.
  • PNPPP personally normalized plasma protein profiles

Abstract

The present invention provides a method for determining an individualised normal level of a biomarker for a test subject for use in analysis of said biomarker in the diagnosis or monitoring of a disease or its treatment in said subject, said method comprising: a) determining the level of a biomarker in samples of a body fluid or tissue in a control population free from said disease, to obtain a set of control abundance levels for said biomarker in a said sample; b) analysing the control biomarker abundance levels of step (a) with respect to one or more non-disease related phenotypic factors to determine which phenotypic factors have a statistically significant effect on the biomarker abundance levels in said control population thereby to identify phenotypic covariates for said biomarker, and performing a statistical analysis to determine the effect of any such phenotypic covariate(s) identified on the variance of the control abundance levels; c) optionally transforming the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, to obtain a normal distribution; d) using the normalised residual control abundance level values from step (c) or the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, in a step of statistical analysis of genetic data comprising genetic variants identified in said control population to determine whether any one or more non-disease related genetic covariate(s) have an effect on the abundance levels of said biomarker in said control population; e) generating a model which is capable of adjusting an abundance level of a said biomarker in a said sample for the effect of the phenotypic and/or genetic covariates identified in steps (b) and (d); f) assessing the phenotype and/or genotype of the test subject with respect to the phenotypic and/or genetic covariates for said biomarker identified in steps (b) and (d) to determine the individual phenotypic and/or genetic covariates for said test subject; g) and using the model of step (e) to determine a value for a normal level for the abundance of the biomarker in a said sample from the said test subject having said individual phenotype and genotype as determined in step (f), thereby to determine an individualised normal level of the biomarker for said test subject.

Description

Determination and analysis of Biomarkers in clinical samples
The present invention relates to methods for determining the level of a biomarker in a subject. More specifically, the present invention relates to a method for deriving
individualised normal values for a biomarker based on a variety of different phenotypic factors, e.g. specific lifestyle, clinical, anthropomorphic and genetic factors that may affect the level of a biomarker in a subject. Such an individualised normal value for the level of a biomarker, which is adjusted to take account of such phenotypic and/or genetic factors, may be used as a personalised clinical cut-off value which may increase the specificity of the biomarker and/or the sensitivity of a biomarker-based diagnostic or prognostic method. A method of identifying a biomarker, e.g. for use in the diagnosis or monitoring of a disease or its treatment in a subject, is also provided.
Biomarkers, typically protein biomarkers, are used for diagnosis and management of cancers and other diseases. Examples include prostate-specific antigen (PSA) used to screen for prostate cancer, the ovarian cancer-related tumour marker CA125 and IL-6, which is a drug target in rheumatoid arthritis. Many other biomarkers are in clinical use or have been proposed or are being investigated, either as markers for use in disease detection or diagnosis, to predict responsiveness to a medicament or other therapy, and/or to monitor the progress of a disease and/or its treatment. As well as protein biomarkers, genetic markers have also been identified and investigated. The discovery of putative biomarkers for the early identification and management of cancer and other diseases has been greatly facilitated in recent years by high throughput, genome-wide assays. Gene expression analyses have discovered numerous genes that are differentially expressed between cancerous or diseased tissue and healthy tissue, but few have proven suitable for use as biomarkers, mainly because mRNA levels do not correlate well with protein abundance.
Biomarkers used for disease diagnosis or monitoring should ideally be uniquely present or overexpressed in the diseased tissue or blood and not influenced by confounding factors, that is display deviating levels in affected individuals only, and be robust to factors unrelated to disease. However most current biomarkers have a function in a normal cell, taking part in e.g. signalling pathways, controlling growth, apoptosis and/or inflammation. They are not uniquely expressed in cancerous or diseased tissue. Additionally, the level of these biomarkers may be affected by a number of factors, such as an individual's genetic and physical constitution, lifestyle and medication. However, whilst it has been recognised that certain biomarkers may in certain conditions be affected by various factors such as medications taken, smoking or age, and that others may be affected by genetic variations present in an individual subject, there has not so far been a detailed systematic study of biomarker variation in a normal, non-diseased subjects, and the effects that different non- disease related factors, such as lifestyle, environmental, anthropomorphic and clinical factors may have on biomarker abundance levels.
The present inventors have undertaken such a study, to study the causes of variation in the abundance of levels in a clinical sample of a set of diverse established or putative biomarkers for different diseases, including cancer, autoimmune diseases and inflammatory conditions. We determined the effect of a wide range of clinical and anthropomorphic variables and lifestyle or environmental factors on biomarker levels in a population of subjects. Further a genome-wide analysis was performed to study the possible effects of genetic variations in the population on the levels of biomarkers in the samples. This study is the first to measure biomarker abundance on a large scale in a general population, using the same technology for all the biomarkers and for all the subjects in the population, to assess contributing factors for normal variation. We have found that, to a surprising degree, a number of factors, unrelated to the disease, have dramatic effects on the level of the biomarkers. More particularly our models have shown that such factors can account for around 20-60% of the variation seen levels of biomarker in non-diseased subjects. We propose that information on the specific factors, genetic and/or otherwise, that affect each marker, and the way (direction) that they affect the level can be used significantly to reduce the variance and to derive individualised normal levels and thereby contribute to increased specificity of the biomarkers as diagnostic or prognostic tools.
More particularly, we propose that a detailed understanding of the factors that affect the level of a biomarker will be beneficial for biomarker use in a clinical setting and particularly before any of the growing number of candidate biomarkers are used in a clinical setting. The present invention thus aims to understand the factors that influence normal variation in levels of a biomarker in a clinical sample, with the goal of determining an individualised normal level of a biomarker for a test subject, to establish a personalised clinical cut-off value that would increase the sensitivity of using biomarkers in clinical practice.
The present invention provides a method of determining the effect that any of a number of lifestyle, anthropomorphic, clinical and genetic factors have on the level of a biomarker within a subject. Based on this information an individualised normal level of a biomarker may be derived for an individual subject given that individual's lifestyle, anthropomorphic, clinical and genetic factors, and used to determine an individualised clinical cut-off value for a biomarker in that subject, thereby to enable a more efficient use of a biomarker in personalised disease management.
Thus in one aspect the present invention provides a method for determining an individualised normal level of a biomarker for a test subject for use in analysis of said biomarker in the diagnosis or monitoring of a disease or its treatment in said subject, said method comprising:
(a) determining the level of a biomarker in samples of a body fluid or tissue in a control population free from said disease, to obtain a set of control abundance levels for said biomarker in a said sample;
(b) analysing the control biomarker abundance levels of step (a) with respect to one or more non-disease related phenotypic factors to determine which phenotypic factors have a statistically significant effect on the biomarker abundance levels in said control population thereby to identify phenotypic covariates for said biomarker, and performing a statistical analysis to determine the effect of any such phenotypic covariate(s) identified on the variance of the control abundance levels;
(c) optionally transforming the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, to obtain a normal distribution;
(d) using the normalised residual control abundance level values from step (c), or if normalisation step (c) is not performed the residual control abundance levels from step (b) (which have been adjusted for the effect of the phenotypic covariates), in a step of statistical analysis of genetic data comprising genetic variants identified in said control population to determine whether any one or more non-disease related genetic covariate(s) have an effect on the abundance levels of said biomarker in said control population;
(e) generating a model which is capable of adjusting an abundance level of a said biomarker in a said sample for the effect of the phenotypic and/or genetic covariates identified in steps (b) and (d);
(f) assessing the phenotype and/or genotype of the test subject with respect to the phenotypic and/or genetic covariates for said biomarker identified in steps (b) and (d) to determine the individual phenotypic and/or genetic covariates for said test subject; and
(g) using the model of step (e) to determine a value for a normal level for the
abundance of the biomarker in a said sample from the said test subject having said individual phenotype and genotype as determined in step (f), thereby to determine an individualised normal level of the biomarker for said test subject.
It will be seen that the method of the invention relies on identifying non-disease related phenotypic and genetic factors which have a statistically significant effect on the abundance level of a biomarker. The method provides a model which is capable of adjusting an abundance level of a biomarker in a test subject, thereby to determine an individualised normal level of a biomarker for a test subject once the identified phenotypic and genetic factors identified in steps (b) and (d) have been assessed for the test subject. The model generated by the method of the invention integrates information on the relevant non-disease related phenotypic and/or genetic factors for a biomarker and uses this to determine, or to calculate, an individualised normal level. By taking account of the non-disease related factors which may affect the biomarker level in the clinical sample under test, disease- related deviations from such an individualised normal level can be more specifically identified, or more precisely assessed. Thus the precision, or accuracy, of the biomarker may be improved. In some cases, the method may allow the use of biomarkers for use in the diagnosis or monitoring of a disease or its treatment that were previously not suitable for these purposes, thereby increasing the number of candidate biomarkers which could be used in a clinical setting.
A further aspect of the invention is directed to the generation of the model. In this aspect the present invention provides a method of generating a model which is capable of adjusting an abundance level of a biomarker in a sample of a body tissue or fluid for the effect of phenotypic and/or genetic covariates which affect the level of said biomarker in said sample, said method comprising:
(a) determining the level of a biomarker in samples of a body fluid or tissue in a
control population free from said disease, to obtain a set of control abundance levels for said biomarker in a said sample;
(b) analysing the control biomarker abundance levels of step (a) with respect to one or more non-disease related phenotypic factors to determine which phenotypic factors have a statistically significant effect on the biomarker abundance levels in said control population thereby to identify phenotypic covariates for said biomarker, and performing a statistical analysis step to determine the effect of any such phenotypic covariate(s) identified on the variance of the control abundance levels;
(c) optionally transforming the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, to obtain a normal distribution;
(d) using the normalised residual control abundance level values from step (c), or if normalisation step (c) is not performed the residual control abundance levels from step (b) (which have been adjusted for the effect of the phenotypic covariates), in a step of statistical analysis of genetic data comprising genetic variants identified in said control population to determine whether any one or more non-disease related genetic covariate(s) have an effect on the abundance levels of said biomarker in said control population; and (e) generating a model which is capable of adjusting an abundance level of a said biomarker in a said sample for the effect of the phenotypic and/or genetic covariates identified in steps (b) and (d).
It will be appreciated that for any given biomarker the analysis of a control population need be performed only once, to identify the phenotypic and/or genetic covariates, and to analyse their effect. Thus, in the practice of the method in the context of an individual test subject, a predetermined model may be used, to analyse the phenotypic and/or genotypic data for that test subject, to determine the individualised normal level. In such a case only steps (f) and (g) of the method above for determining the individualised normal level would be performed.
Accordingly, in a further aspect, the present invention also provides a method for determining an individualised normal level of a biomarker for a test subject for use in analysis of said biomarker in the diagnosis or monitoring of a disease or its treatment in said subject, said method comprising:
(i) assessing the phenotype and/or genotype of the test subject with respect to phenotypic factors and/or genetic variants which have been identified as being phenotypic and/or genetic covariates for the abundance level of the biomarker in a said sample (more particularly, as defined in steps (b) and (d) of the method above), to determine the individual phenotypic and/or genetic covariates for said test subject;
(ii) using a model obtained according to the model generation method above to determine a value for a normal level for the abundance of the biomarker in a said sample from the said test subject having said individual phenotype and genotype as determined in step (f), thereby to determine an individualised normal level of the biomarker for said test subject.
However, in other embodiments, even though covariates may have already been identified and a model generated, the step of generating a model may be repeated in the context of an individual test subject, or group of test subjects, for example a group of subjects with a disease. In such a case, the same panel of covariates may be used, or a smaller subset of the covariates or indeed a selected subset or panel of covariates may be used, for example based on covariates common to the control population and the disease group.
In a still further aspect the present invention provides a method of detecting a biomarker in a test subject, said method comprising:
(a) determining the level of the biomarker in a body fluid or tissue sample of said test subject; (b) using a model obtained according to the invention as hereinbefore defined to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and/or genetic covariates on the abundance level of the biomarker in a said sample; and (c) comparing said adjusted value to an individualised normal level for the marker in a said sample obtained according to the method of the invention as hereinbefore defined.
A yet further aspect of the invention provides a method of diagnosing or monitoring a disease, or the treatment thereof, in a subject, said method comprising detecting the presence of a biomarker in said subject using the hereinbefore defined detection method.
It will be seen that the model provided by the invention may be used to identify new biomarkers (e.g. to determine whether a new or known biomarker (e.g. a new or known protein) is useful as a biomarker for a particular disease) and/or confirm or establish the utility of putative or candidate biomarkers for a particular disease. In this respect, it is possible to determine whether a candidate or putative biomarker would be useful as a biomarker for a particular disease by comparing the adjusted value(s) of the abundance level of the candidate or putative biomarker derived from a subject (or population of subjects) with a disease to the adjusted value(s) of the abundance level of the candidate or putative biomarker derived from a subject (or population of subjects) free from the disease, i.e. a control subject or population. A difference between the adjusted levels (i.e. an increase or decrease), particularly a statistically significant difference as defined below, may be indicative that the biomarker would find utility in the diagnosis or monitoring of the disease or its treatment in a subject. As noted above, the analysis of a control subject or population need be performed only once, to identify the phenotypic and/or genetic covariates, and to analyse their effect, i.e. to determine the adjusted value for the abundance level of the biomarker in a control subject or population. However, in other embodiments, the step of adjusting the value(s) for the abundance level of a biomarker in a control subject or population may be repeated, e.g. in the context of an individual disease or candidate biomarker.
Thus, a further aspect of the invention provides a method of identifying a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject, said method comprising:
(a) determining the level of a candidate biomarker in a body fluid or tissue sample of a subject with said disease;
(b) using the model obtained according to the invention as hereinbefore defined to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and/or genetic covariates on the abundance level of the biomarker in a said sample;
(c) comparing the adjusted value from (b) to an adjusted value for the abundance level of said biomarker in a sample from a subject free from said disease; and
(d) determining whether there is a difference between said adjusted values (i.e. the adjusted value for the subject with said disease and the adjusted value for the subject free from said disease);
wherein the presence of a difference between said adjusted values identifies the candidate biomarker as a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject.
More particularly, the invention may be seen to provide a method of identifying a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject, said method comprising:
(a') determining the level of a candidate biomarker in samples of a body fluid or tissue in a control population free from said disease to obtain a set of control abundance levels for said biomarker in said samples;
(b') using the model obtained according to the invention as hereinbefore defined to calculate a set of adjusted abundance values for said biomarker, wherein said values are adjusted for the effects of phenotypic and/or genetic covariates on the abundance level of a biomarker in said samples;
(c') determining the level of a candidate biomarker in a body fluid or tissue sample of a subject with said disease;
(d') using the model obtained according to the invention as hereinbefore defined to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and genotypic covariates identified in step (b');
(e') comparing said adjusted value of step (d') to the set of adjusted values of step (b') for the biomarker in a sample from a subject free from said disease; and (f) determining whether there is a difference between said adjusted value from the subject with said disease, and the adjusted values from the control population; wherein the presence of a difference between said adjusted values identifies the candidate biomarker as a biomarker for use in the diagnosis or monitoring of said disease or its treatment in a subject.
It is noted that the use of the biomarker in the diagnosis or monitoring of a disease or its treatment in a subject preferably does not form part of (i.e. does not form a step in) the method of identifying a biomarker described above. It will be appreciated that for any given biomarker different phenotypic and genetic covariates may be identified and will be used. Although we have found that in many cases both phenotypic and genetic covariates are identified and therefore both types of covariate are used in the methods of the invention, in some cases only phenotypic or only genetic covariate(s) will be identified for any given biomarker. Thus, in such a case only one or more phenotypic or only one or more genetic covariates are used in the model generation and individual test subject assessment steps.
It will also be appreciated that the methods of the invention are not limited to analysing single biomarkers, or one biomarker at a time, and one or more biomarkers may be analysed or assessed or identified. Thus, the methods may be performed using a combination of two or more biomarkers. It is known in this regard that in some cases combinations of markers may be used together, and that such combinations may improve biomarker-based predictions. A model may be generated for each biomarker in such a combination to correct for the effects of the covariates identified for that biomarker, and individualised levels determined for each biomarker and used in combination. By way of representative example, a combination may comprise 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 40, or 50 or more biomarkers, or alternatively up to any one of the aforementioned integers.
The method of the invention may be used to determine an individualised normal level of any biomarker for a test subject that could be used in the diagnosis or monitoring of any disease or its treatment. Thus it is contemplated that biomarkers of the present invention may include any type of molecule that can be detected in any clinical sample and used as a biomarker. Thus the biomarker can be any molecule that occurs in the body. It may for example be a protein or peptide or any molecule comprising a protein or peptide (hereinafter termed a proteinaceous molecule), a lipid or lipid-containing molecule, e.g. a fatty acid, steroid, lipoprotein, nucleic acid, carbohydrate, e.g. glycan or sugar.
In a preferred aspect of the present invention the biomarker is a proteinaceous molecule, for instance any protein complex, soluble or insoluble protein, polypeptide or peptide; the terms "protein" and "proteinaceous" are used broadly herein to include proteins, polypeptides and peptides i.e. any molecule comprising amino acids linked by amide bonds, regardless of size. The protein biomarker can play any functional or structural role on the body. Thus it may include a signalling peptide, pro-peptide, proteolysis product or hormone, blood protein, hormone, cytokine, antibody, lectin, selectin, connective tissue protein or indeed any structural protein, cell receptor, membrane protein, enzyme, e.g. kinase, phosphatase, protease, prion protein, apoptosis factor, or a protein involved in DNA replication or repair or regulation of gene expression etc. e.g. transcription factor etc. Examples of blood proteins include albumins, globulins, fibrinogen, regulatory proteins and clotting factors. Globulins may include Alpha 1 globulins, Alpha 2 globulins, Beta globulins (such as beta-2-micrroglobulin, plasminogen, angiostatins, propoerdin, shx hormone binding globulin and transferrin) and Gamma globulins, which may include
Immunoglobulins. Immunoglobulins which may be IgA, IgD, IgE, IgG and IgM antibodies, or immunoglobulin heavy chain, immunoglobulin light chain, portions or fragments thereof or immunoglobulin domains. Antibodies directed to specific antigens may be used a
biomarkers.
Examples of cytokines include chemokines, Tumour Necrosis Factors (TNFs), or interleukins. Classes of chemokines include CCL proteins (for example CCL1 , CCL2/MCP-1 , CCL3/MIP-1 a, CCL4/MIP-13, CCL5/RANTES, CCL6, CCL7, CCL8, CCL9, CCL1 1 , CCL12, CCL13, CCL14, CCL15, CCL16, CCL17, CCL18/PARC/DC-CK1 /AMAC-1 /Ml P-4, CCL19, CCL20, CCL21 , CCL22, CCL23, CCL24, CCL25, CCL26, CCL27, CCL28), CXCL proteins (for example CXCL1/KC,CXCL2, CXCL3, CXCL4, CXCL5, CXCL6, CXCL7, CXCL8/IL8, CXCL9, CXCL10, CXCL1 1 , CXCL12, CXCL13, CXCL14, CXCL15, CXCL16, CXCL17),
CX3CL proteins, such as CX3CL1 , or XCL proteins such as XCL1 , XCL2. TNF proteins may include TNF (formerly TNF-a), Lymphokines (TNFB/LTA or TNFC/LTB), TNFSF4,
TNFSF5/CD40LG, TNFSF6, TNFSF7, TNFSF8 (also known as CD30-L), TNFSF9,
TNFSF10, TNFSF1 1 , TNFSF13, TNFSF13B (also known as BAFF), EDA. Interleukins may include both type I and type II interleukins. Type I interleukins may include ST2, IL2, IL15, IL4, IL13, IL7, IL9, IL21 , IL3, IL5, GM-CSF, IL6, IL1 1 , IL27, IL30, IL31 , IL12, IL-12B, IL23. IL27A, IL35, IL14, IL16, IL32, or IL34. Type II interleukins may include IL10 family
interleukins (IL-10, it includes IL-19, IL-20, IL-22, IL-24 and IL-26, and interferons, including IFNA1 , IFNA2, IFNA4, IFNA5, IFNA6, IFNA7, IFNA8, IFNA10, IFNA13, IFNA14, IFNA16, IFNA17, IFNA21 , IFNB1 , IFNK, IFNW1 and IFN-γ. Cytokines may also include Macrophage colony stimulating factor 1 (CSF-1 ), I L-1 ra, TNFSF14, Kit ligand (SCF), Fms-related tyrosine kinase 3 ligand (FLT3LG) and TNF-related apoptosis-inducing ligand (TRAIL).
Examples of enzymes include carbonic anhydrase 9 (CAIX), Thiopurine
methyltransferase, UDP-glucuronosyltransferase 1 -1 (UGT-1A), myeloperoxidase, NAD- dependent deacetylase sirtuin-2 (SIRT2), Eosinophil Cationic Protein (ECP),
Triosephosphate isomerase (TIM) and Chitinase-3-like protein 1 (CHI3L1 ). Enzymes may also include proteases, such as stromelysin-1 (MMP-3), Matrix metalloproteinase-1 (MMP- 1 ), Matrix metalloprotease-7 (MMP-7), Matrix metalloproteinase-10 (MMP-10), Matrix metalloproteinase-12 (MMP-12), caspase-3 (CASP-3), caspase-8 (CASP-8), Kallikrein-6 (KLK6), Kallikrein-1 1 (hK1 1 ), Cathepsin-D (CTSD), Cathepsin L1 , prostasin (PRSS8), Renin, Tissue plasminogen activator ( tPA or PLAT), Pappalysin-1 (PAPPA), prostate-specific antigen (PSA), Membrane-bound aminopeptidase P and tartrate-resistant acid phosphatase type 5 (TR-AP). Protease inhibitors, such as WAP four-disulphide core domain protein 2 (WFDC2), metallopeptidase inhibitor 1 (TIMP1 ), and Cystatin-B (CPI-B) may also be detected in the method of the present invention. Enzymes may also include kinases, for example B-Raf, mitogen-activated protein kinases and FIP1 L1 -PDGFR alpha kinase.
Examples of cell surface proteins include CD40, CD40-L (also known as CD154),
Tumor necrosis factor ligand superfamily member 6 (FasL), Fms-related tyrosine kinase 3 (FLT-3) (also known as CD135), Tissue Factor (TF). Cell surface proteins may also include receptors, such as Estrogen Receptor (ER), progesterone receptor (PR), HER2,
Angiopoietin receptors TIE1 and TIE2, Basigin, Receptor for Advanced Glycation
Endproducts (RAGE), Proto-oncogene tyrosine-protein kinase Src, LOX-1 , Protease activated receptors (PAR-1 , PAR-2 and PAR-3), Hepatocyte Growth Factor Receptor (HGF- R), TNF-R1 , TNF-R2, lnterleukin-6 receptor subunit alpha (IL-6RA), MHC class I polypeptide related sequence A (MIC-A), lnterleukin-17 receptor B (IL-17RB), lnterleukin-2 receptor subunit A (IL-2RA) lnterleukin-6 receptor subunit A (IL-6RA), Epidermal growth factor receptor (EGF-R), Receptor tyrosine-protein kinase erbB-2 (ErbB2), Receptor tyrosine- protein kinase erbB-3 (ErbB3), Receptor tyrosine-protein kinase erbB-4 (ErbB4), Platelet- derived growth factor receptors (PDGF-R), Retinoic acid receptor alpha (RAR-a), Tumor necrosis factor receptor superfamily member 5 (FAS), TRAIL-R2, Osteoprotegerin (OPG), Folate receptor alpha (FR-alpha), Urokinase plasminogen activator surface receptor (U- PAR) and Vascular endothelial growth factor receptor 2 (VEGFR-2). Alternatively, cell surface proteins may be a cell adhesion molecule, for instance carcinoembryonic antigen- related cell adhesion molecule 5 (CEA). E-selectin (also known as CD62E, ELAM-1 or LECAM2), Selectin P ligand (PSGL-1 ), Platelet endothelial cell adhesion molecule (PECAM- 1 ) (also known as CD31 ) and Epithelial cell adhesion molecule (Ep-CAM).
Certain cell surface proteins are also known to be antigens that can be detected as markers for cancer, such as CA242, CD30 and mucin-16 (MUC1 -16/CA125). Cell surface proteins can also be cleaved from the cell membrane, and thus be detected as soluble proteins in a blood or tissue sample. Other soluble proteins that can act as markers for cancer include Human epididymis protein 4 (HE4),
The biomarker may also be a lectin. These may include Regenerating islet-derived protein 4 (REG-4), CD69 and galectin-3 (Gal-3).
Examples of connective tissue proteins include collagens (including collagen I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI, XXII XXIII, XXIV, XXV, XXVI, XXVII and XXVIII), elastin, fibrillins (including fibrillin 1 , 2, 3 and 4), fibulins (including fibulin 1 , 2, 3, 4, 5, and 7, and HMCN1 ), latent transforming growth factor binding proteins (LTPBs) (including LTBP 1 , 2, 3 and 4), perlecans, elualin and oxytalan. Hormones that may biomarkers include Adrenomedullin (ADM), Agouti-related peptide (AgRP), Erythropoietin (EPO), follistatin (FS) prolactin (PRL), Amylin, Anti-Mijlerian hormone, adiponectin, adrenocorticotropic hormone, angiotensin, antidiuteric hormone (ADH), atrial-natriuretic peptide, brain natriuretic peptide, calcitonin, cholecystokinin, corticotropin-releasing hormone, ehkephalin, endothelin, Follicle-stimulating hormone (FSH), gelanin, gastrin, ghrelin, gonadotropin-releasing hormone, growth-hormone releasing hormone, hepcidin, human chorionic gonadotropin, human placental lactogen, inhibin, insulin-like factor, insulin, leptin, lipotropin, luteinising hormone, melanocyte stimulating hormone, motillin, orexin, oxytocin, pancreatic polypeptide, parathyroid hormone, prolactin- releasing hormone, relaxin, renin, resistin, secretin, somatostatin, thrombopoietin, thyroid- stimulating hormone, thyrotropin-releasing hormone, vasoactive intestinal peptide and glucagon.
Growth factors may also be biomarkers in the methods of the present invention. Examples of growth factors include epiregulin (EPR), betacellulin (BTC), Vascular endothelial growth factor A (VEGF-A), Vascular endothelial growth factor D (VEGF-D),
Epidermal growth factor (EGF), Melanoma-derived growth regulatory protein (also known as Melanoma inhibitory activity protein (MIA), amphiregullin (AR), probetacellulin, betacellulin, growth differentiation factor 15 (GDF-15), Growth hormone (GH) (also known as
somatropin), proheparin-binding EGF-like growth factor (HB-EGF) Hepatocyte growth factor (HGF), Nerve growth factor (NGF), Beta-nerve growth factor (Beta-NGF), Midkine, connective tissue growth factor (CTGF), Platelet-derived growth factor subunit B (PDGF subunit B), Placenta growth factor (PIGF), Transforming growth factor beta-1 (TGF-β-Ι ), Protransforming growth factor alpha (TGF-a) and Fibroblast growth factor 23 (FGF23) .
Various other proteins may also be biomarkers in the methods of the present invention. For example, intracellular proteins involved in cell signalling pathways such as Myeloid differentiation primary response protein MyD88 (MYD88) and Fatty Acid binding protein (adipocyte) (FABP4) may be tested. Heat shock proteins (such as HSP-27),
Dickkopf-related protein 1 (DKK1 ), latency-associated peptide (LAP), Endothelial cell- specific molecule 1 (ESM-1 ), myoglobin, haemoglobin, UGT1A1 , KRAS, p53, BRCA1 , BRCA1 , p16, CDKN2B, p14ARF, MYOD1 , CDH1 , CDH13, S100 proteins (such as Protein S100-A12 (EN-RAGE)), Thrombomodulin (TM), Pentraxin-related protein PTX3 (PTX3), cytochrome c, nucleosomes, F-spondin (also known as SPON-1 ) and NF-kappa-B essential modulator (NEMO) may also be detected.
The method of the present invention may also be used in the detection of plaque proteins, include amyloid protein, tau protein. It may also be desirable to determine the abundance levels of isoforms of apoliprotein in a test subject according to the present invention. Peptides such as galanin may also be detected. In one representative example, the biomarker may be selected from the list of biomarkers investigated in Example 1 below (see Table 4). In another representative example, the biomarker may be selected from the list of biomarkers investigated in Example 2 (see Table 6). In yet another representative example, the biomarker may be selected from the list of biomarkers investigated in Example 4 (see Table 9).
According to the present invention a subject or test subject, and hence a control subject (that is a subject in a control population), may be any human or non-human animal subject, but particularly will be any mammalian organism. Preferably the subject will be a human, but other subject or test subjects may be domestic or livestock animals, zoo animals, horses etc. The subjects in the control population will be the same as the subject or test subject (in the sense of same species etc.).
A sample obtained from any bodily fluid or tissue may be used in the methods of the present invention. The sample may thus be any clinical sample. It may thus be any sample of body tissue, cells or fluid, e.g. a biopsy sample, or any sample derived from the body, e.g. a swab, washing, aspirate or rinsate etc. Suitable clinical samples include, but are not limited to, blood, serum, plasma, blood fractions, joint fluid, urine, semen, saliva, faeces,
cerebrospinal fluid, gastric contents, vaginal secretions, mucus, a tissue biopsy sample, tissue homogenates, bone marrow aspirates, bone homogenates, sputum, aspirates, wound exudate, swabs and swab rinsates e.g. a nasopharyngeal swab, other bodily fluids and the like. In a preferred embodiment, the clinical sample is sample is blood or a blood-derived sample, e.g. serum or plasma or a blood fraction.
The present invention may be used to determine an individualised normal level of a biomarker for a test subject, or identify a biomarker, that is associated with the diagnosis or monitoring of any known disease or its treatment. The disease may include any known clinical condition, syndrome or disorder, including clinical conditions or states which precede or presage overt or symptomatic disease, including notably for example cancer, autoimmune disease, neurological disorders e.g. neurodegenerative diseases, infectious disease, inflammation or any inflammatory disease or condition, connective tissue diseases, cardiovascular diseases or conditions or endocrine disorders. In some embodiments, the disease is a non-communicable disease.
Representative cancers include Acute Lymphoblastic Leukaemia (ALL), Acute Myeloid Leukaemia (AML), Adrenocortical Carcinoma, AIDS-Related Cancer (e.g. Kaposi Sarcoma and Lymphoma), Anal Cancer, Appendix Cancer, Astrocytomas, Atypical
Teratoid/Rhabdoid Tumour, Basal Cell Carcinoma, Bile Duct Cancer, Extrahepatic Bladder Cancer, Bone Cancer (e.g. Ewing Sarcoma, Osteosarcoma and Malignant Fibrous
Histiocytoma), Brain Stem Glioma, Brain Cancer, Breast Cancer, Bronchial Tumours, Burkitt Lymphoma, Carcinoid Tumour, Cardiac (Heart) Tumours, Cancer of the Central Nervous System (including Atypical Teratoid/Rhabdoid Tumour, Embryonal Tumours, Germ Cell Tumour, Lymphoma), Cervical Cancer, Chordoma, Chronic Lymphocytic Leukemia (CLL), Chronic Myelogenous Leukaemia (CML), Chronic Myeloproliferative Disorder, Colon Cancer, Colorectal Cancer, Craniopharyngioma, Cutaneous T-Cell Lymphoma, Bile Duct Cancer, Extrahepatic Ductal Carcinoma In Situ (DCIS), Embryonal Tumours, Endometrial Cancer, Ependymoma, Esophageal Cancer, Esthesioneuroblastoma, Ewing Sarcoma, Extracranial Germ Cell Tumour, Extragonadal Germ Cell Tumour, Extrahepatic Bile Duct Cancer, Eye Cancer (including Intraocular Melanoma and Retinoblastoma), Fibrous Histiocytoma of Bone, Gallbladder Cancer, Gastric (Stomach) Cancer, Gastrointestinal Carcinoid Tumour, Gastrointestinal Stromal Tumours (GIST), Germ Cell Tumor, Gestational Trophoblastic Disease, Glioma, Hairy Cell Leukaemia, Head and Neck Cancer, Heart Cancer,
Hepatocellular (Liver) Cancer, Histiocytosis, Langerhans Cell, Hodgkin Lymphoma,
Hypopharyngeal Cancer, Intraocular Melanoma, Islet Cell Tumours, Pancreatic
Neuroendocrine Tumours, Kaposi Sarcoma, Kidney Cancer (including Renal Cell and Wilms Tumour), Langerhans Cell Histiocytosis, Laryngeal Cancer, Leukaemia (including Acute Lymphoblastic (ALL), Acute Myeloid (AML), Chronic Lymphocytic (CLL), Chronic
Myelogenous (CML), Lip and Oral Cavity Cancer, Liver Cancer (Primary), Lobular
Carcinoma In Situ (LCIS), Lung Cancer, Lymphoma, Macroglobulinemia, Waldenstrom, Melanoma, Merkel Cell Carcinoma, Mesothelioma, Metastatic Squamous Neck Cancer with Occult Primary, Midline Tract Carcinoma Involving NUT Gene, Mouth Cancer, Multiple Endocrine Neoplasia Syndromes, Childhood, Multiple Myeloma/Plasma Cell Neoplasm, Mycosis Fungoides, Myelodysplastic Syndromes, Myelodysplastic/Myeloproliferative
Neoplasms, Multiple Myeloma, Myeloproliferative Disorders, Nasal Cavity and Paranasal Sinus Cancer, Nasopharyngeal Cancer, Neuroblastoma, Non-Hodgkin Lymphoma, Non- Small Cell Lung Cancer, Oral Cancer, Oral Cavity Cancer, Oropharyngeal Cancer,
Osteosarcoma, Ovarian Cancer, Pancreatic Cancer, Pancreatic Neuroendocrine Tumours (Islet Cell Tumors), Papillomatosis, Paraganglioma, Paranasal Sinus and Nasal Cavity Cancer, Parathyroid Cancer, Penile Cancer, Pharyngeal Cancer, Pheochromocytoma, Pituitary Tumor, Plasma Cell Neoplasm/Multiple Myeloma, Pleuropulmonary Blastoma, Pregnancy and Breast Cancer, Primary Central Nervous System (CNS) Lymphoma,
Prostate Cancer, Rectal Cancer, Renal Cell (Kidney) Cancer, Renal Pelvis and Ureter, Transitional Cell Cancer, Retinoblastoma, Rhabdomyosarcoma, Salivary Gland Cancer, Sarcoma, Sezary Syndrome, Skin Cancer, Small Cell Lung Cancer, Small Intestine Cancer, Soft Tissue Sarcoma, Squamous Cell Carcinoma, Squamous Neck Cancer with Occult Primary, Metastatic, Stomach (Gastric) Cancer, T-Cell Lymphoma, Testicular Cancer, Throat Cancer, Thymoma and Thymic Carcinoma, Thyroid Cancer, Transitional Cell Cancer of the Renal Pelvis and Ureter, Urethral Cancer, Uterine Cancer, Endometrial, Uterine Sarcoma, Vaginal Cancer, Vulvar Cancer, Waldenstrom Macroglobulinemia, and Wilms Tumour.
Examples of autoimmune disease include rheumatoid arthritis, Graves' disease, Crohn's disease, autoimmune asthma, Addison's disease, motor neurone disease, multiple sclerosis, diabetes mellitus type 1 , lupus, eczema, rheumatic fever, thrombocytopenia, urticarial vasculitis and vasculitis.
Examples of neurological disorders include dementia, e.g. Alzheimer's disease, Parkinson's disease, Creutzfeldt-Jakob disease (CJD), cerebral palsy, motor neurone disease, aneurism, stroke (e.g. ischemic stroke), Ataxia telangiectasia (A-T), leukodystrophy, Huntingdon's disease, Pick's disease, Dawson disease, Guillain-Barre syndrome (GBS) and Wilson's disease.
Examples of infectious disease include any kind of infection in any tissue or region of the body, but notably sepsis, pneumonia, meningitis, typhus, tuberculosis, gastroenteritis, cellulitis and urinary tract infections. Diseases can also include viral infections, for example infections with HIV, HBV, measles, influenza, and viral meningitis.
Examples of connective tissue disease include Marfan syndrome, osteogenesis imperfect, osteoarthritis, osteoporosis, rickets and scurvy.
Examples of cardiovascular disease include cardiac conditions e.g. angina, myocardial infarction, heart failure, cardiomyopathy, atherosclerosis, coronary heart disease, hypertension, cardiac dysrhythmias, endocarditis, myocarditis and rheumatic heart disease.
Examples of endocrine disorders include, Addison's disease, Adrenocortical carcinoma, Type 1 and Type 2 diabetes, gestational diabetes, hyperthyroidism,
hypothyroidism, thyroidosis, diabetes insipidus, hypopituitarism, hypogonadism.
The present invention may be used in any aspect of the diagnosis or monitoring of a disease and/or its treatment. Diagnosis may be viewed as the process of identifying a subject's medical condition that allows decisions or choices to be made about the subject's treatment. Thus diagnosis may include identifying the disease in a subject. A biomarker may also be used in prognosis, for example to predict the progress or development of a disease, or to predict it's response to treatment e.g. to a given therapy, for example to determine which therapeutic intervention (e.g. of a number of possible options) may be effective, or may work best (or be expected to work best) or be most appropriate. Responsiveness to treatment may also be monitored.
The aim of the invention is to determine an individualised normal level of a biomarker for a test subject. A "normal" level is a level that may be expected, or may occur, or be determined in a test or control subject in the absence of the disease in question. "Normal" thus indicates absence of disease (specifically the disease under investigation). The normal level (e.g. from a control subject(s)) is adjusted, or corrected, for the effect of phenotypic and/or genetic covariates that may affect the normal level, that is that may have an effect on the level in a normal control population of non-diseased subjects ("non-diseased" in this context means absence of the disease in question). The adjusted or corrected normal level may thus be viewed as a clinical cut-off value which may be used to distinguish between the presence or absence of disease, for example to distinguish a test subject having the disease from a control subject not having the disease. An individualised normal level is adjusted or corrected for the phenotypic and/or genetic factors particular to that test subject, and thus represents a personalised clinical cut-off value. A level for the biomarker that is then determined for a test subject, may thus be adjusted, or corrected, for the phenotypic and/or genetic covariates and compared to the individualised normal value for that test subject (i.e. to the personalised cut-off), to determine whether that biomarker level (as determined for the test subject) is deviant from the individualised normal level, and hence indicative of disease, disease status or prognosis etc. (e.g. treatment responsiveness or non-responsiveness, or disease progression etc.).
It will be appreciated that in some cases a biomarker may be detected in more than one type of clinical sample from a subject. In such a case the biomarker levels may vary depending on the sample. Accordingly the clinical sample from the control subjects should be the same sample (i.e. same type of sample) as for the test subject, e.g. the subject suspected of having the disease or a subject with the disease.
The step of determining the abundance level of a biomarker in a sample may be performed by any means in the art. The abundance level is simply a measure or indication of the level or amount of a biomarker in a given clinical sample, and may be assessed or determined in different ways, for example as an absolute or total amount of biomarker present, or as a concentration or ratio etc., or any other indication of level or amount . It will be understood that the same measure or indicator or assessment of level or amount as is determined in step (a) of the method for determining the individualised level (that is the step of determining the biomarker level in samples from the control population) will also be used to determine the biomarker level in a test subject, to enable a direct comparison to be made. It is accordingly preferred that the same means of determining abundance level be used for the control population and for the test subject, e.g. the subject suspected of having the disease or a subject with the disease.
A variety of different quantitative or semi-quantitative detection assays are known in the art for detecting different molecules, including any biomolecules that may be biomarkers. Any number of separation techniques may be used to determine the abundance level of a biomarker in a sample, for instance high performance liquid chromatography (HPLC), liquid chromatography, gel electrophoresis, or blotting techniques. Mass spectroscopy techniques, such as MALDI-TOF, ESI-MS or Tandem-MS may also be used, and may be combined with chromatographic techniques for sample analysis.
Where a biomarker has enzymatic activity or is a substrate for an enzyme-catalysed reaction, the abundance level of a biomarker may also be determined directly using an enzyme-base assay. For example, spectrophotometric, fluorometric, calorimetric, chemiluminescent or radiometric assays may be used in conjunction with suitable cofactors or substrates known in the art.
Similar functional tests may be used to determine the amount or level of any biomarker having a functional effect (e.g. a biological effect) that can be determined in a functional assay.
However, in many cases it may be most practical to determine a biomarker level by using an affinity binding partner to bind to the biomarker (i.e. an affinity reagent for the biomarker). This may allow a biomarker to be separated from a sample, although such a separation step is not always necessary, and the biomarker may be detected and quantified (i.e. measured or the level determined) by determining the amount of affinity reagent bound. Typically such assays are performed using antibodies (the term "antibody" is used broadly herein to include any type of antibody, antibody fragments and antibody derivatives, including synthetic antibodies such as single chain antibodies, CDR-grafted antibodies, chimeric antibodies etc.) and an immunoassay using an antibody is a preferred means of performing the biomarker level determination steps. However, other affinity reagents are known and used, including notably other proteinaceous affinity reagents as lectins, receptors, including immunological molecules such as T-cell receptors or antigen-binding molecules derived therefrom, and synthetic molecules such as affibodies, and proteins or peptides which may be identified by screening methods known in the art, e.g. by phage or other peptide display techniques. Other affinity binding molecules include nucleic acids, e.g. aptamers, or other oligonucleotides capable of binding to a target molecule. Procedures for obtaining and identifying such nucleic acid based affinity reagents are known, e.g. Selex procedures etc.
Such immunoassay or analogous affinity reagent-based assays may be performed in various formats according to procedures and principles well known in the art, e.g. in solution (homogenous formats) or in solid phase-based formats, e.g. sandwich assays, competitive assays etc. By way of representative example may be mentioned ELISA, immunoPCR and immunoRCA assays. The antibody or other affinity reagent may be detected in various ways, typically by labelling the reagent, directly or indirectly, with a signal-giving label (which may be detected directly or indirectly) or reporter molecule. A wide variety of such labels and reporters and assays based on them are known in the art and any of these could be used. Thus, such labels may include simple colorimetrically or otherwise spectrophotometrically detectable labels (e.g. fluorescent labels, or any label which can be detected by any other means, e.g. detectable isotopes (e.g. radiolabels), colloidal materials, particles, quantum dots etc. The reagents may be enzymically labelled or with enzyme substrates and the products of enzymic reactions may be detected. The label may comprise a nucleic acid molecule which may be detected, most typically amplified and detected e.g. as in
immunoPCR or immuno RCA.
In one embodiment a proximity assay may be used in the biomarker detection step(s). A proximity assay relies on the principle of "proximity probing", wherein an analyte is detected by the binding of multiple (i.e. two or more, generally two or three) probes, which when brought into proximity by binding to the analyte allow a signal to be generated.
Typically at least one of the proximity probes comprises a nucleic acid domain (or moiety) linked to the analyte-binding domain (or moiety) or the probe, and generation of the signal involves an interaction between the nucleic acid moieties and/or a further functional moiety which is carried by the other probe(s). The signal generation is dependent on an interaction between the probes (more particularly between the nucleic acid or other functional moieties/domains carried by them) and hence only occurs when both the necessary (or more) probes have bound to the analyte, thereby lending improved specificity to the detection system.
Proximity probes of the art are generally used in pairs, and individually consist of an analyte-binding domain with specificity to the target analyte, and a functional domain, e.g. a nucleic acid domain coupled thereto. The analyte-binding domain can be for example a nucleic acid "aptamer" (Fredriksson et al (2002) Nat Biotech 20:473-477) or can be proteinaceous, such as an antibody (Gullberg et al (2004) Proc Natl Acad Sci USA
101 :8420-8424). The respective analyte-binding domains of each proximity probe pair may have specificity for different binding sites on the analyte, which analyte may consist of a single molecule or a complex of interacting molecules, or may have identical specificities, for example in the event that the target analyte exists as a multimer.
The concept of proximity probing has been developed in recent years and many assays based on this principle are now well known in the art, and many variations of proximity probe based assays exist, any of which can be used in the method of the present invention to determine the abundance level of a biomarker in a sample. These include proximity ligation assays (PLA) in which the nucleic acid domains of two or more probes interact by ligation, either to each other, or by templating the ligation of one or more added oligonucleotides (including the ligation of one or more oligonucleotides into a circularised nucleic acid molecule which may be detected by rolling circle amplification (RCA)) and proximity extension assays (PEA), in which the nucleic acid domains, which may be single- or partially double-stranded, may interact by hybridisation to each other, and one or more domains, or strands thereof, may be extended. The ligation or extension products generated in such assays may be detected, e.g. amplified and detected, for example by RCA where a circularised ligation product is generated or by PCR. Quantitative assays based on such techniques have been described. Any such assay may be used in the method of the present invention to determine the abundance level of a biomarker in a sample. In a preferred embodiment of the present invention a PEA is used, especially preferably a PEA wherein the extension product is determined by a quantifiable PCR method, thereby to quantify the biomarker in the sample.
For example, proximity assays are described in WO 01/61037, US 6,51 1 ,809 and WO 2006/137932 and both heterogeneous (e.g. where the analyte is first immobilised to a solid substrate by means of a specific analyte-binding reagent) and homogeneous (i.e. in solution) formats for proximity probe based assays have been disclosed, e.g. WO 01/61037, WO 03/044231 , WO 2005/123963, Fredriksson et al (2002) Nat Biotech 20:473-477 and Gullberg et al (2004) Proc Natl Acad Sci USA 101 :8420-8424. Although pairs of proximity probes are generally used, modifications of the proximity-probe detection assay have been described, in e.g. WO 01/61037, WO 2005/123963 and WO 2007/107743, where three proximity probes are used to detect a single analyte molecule. PEA assay formats which may be used in the present invention are described in WO 2012/104261.
Where two or more biomarkers are assessed, for example when biomarker levels in a control population are determined or when a combination of biomarkers in an individual test subject is assessed, the abundance level of a biomarker may conveniently be determined by any of the above detection assays in a multiplexed format. In a preferred embodiment of the present invention, the multiplexed detection assay is a solid phase detection assay, and in a most preferred embodiment of the present invention is a multiplexed proximity extension assay. For example multiplexed sample detection may take place on an array. In a preferred embodiment of the present invention microarray detection of a biomarker in a sample may be utilised, for example using the Olink Proseek Multiplex Oncology l96x96 kit or the Olink Proseek Multiplex CVD l96x96 kit.
In step (a) of the method of the present invention the abundance level of a biomarker is determined in a control population in order to obtain a set of control abundance levels for said biomarker in a given or selected clinical sample in said control population. The control population is typically a population of healthy subjects, more particularly subjects from the same species as the test subject. The control population is free from the disease for which the test subject is being diagnosed, monitored or treated (or for which the candidate or putative biomarker is being tested or identified), in order for non-disease-related phenotypic and genetic covariates to be identified. The size of the control population may vary depending on the biomarker, disease and available samples etc. However, it will typically be large enough for a statistical analysis of the variance in abundance levels to be performed, This may depend on the biomarker and the error-levels (uncertainties) in the measurements of the biomarker levels and the covariates. For a very strong correlation only a very small number may be required. A minimum population size is usually in the order of 10 control subjects, and may in some cases be less, e.g. at least 5, 6, 7, 8, or 9, but typically will be at least 20, 30, 40, 50, 60, 70, 80, 90 or 100 control subjects, more typically at least 200, 300, 400 or 500 control subjects. A minimum population size may also be at least 600, 700, 800, 900 or 1 ,000, control subjects, or at least 2,000, 5,000, 10,000, 20,000, 50,000 or 100,000 control subjects. A control population may also be more than 100,000 control subjects.
In some embodiments of the method of identifying a biomarker, the abundance level of a candidate biomarker is determined in a population of subjects with the disease in order to determine the adjusted value of the level of the candidate biomarker in subjects with the disease. In these embodiments, the size of the population may be defined as described above.
In step (b) the control abundance levels are analysed to determine which phenotypic factors if any have an effect on the normal biomarker level, that is the level of biomarker in a normal (in the sense of non-diseased) subject i.e. which factors if any contribute to any variance observed, and the extent of such a contribution. This analysis is performed using standard statistical analysis techniques, as described in more detail below.
One or more non-disease related phenotypic factors may be assessed with respect to biomarker abundance levels in the method of the present invention. A phenotypic factor is any possible variable which may affect a subject, whether related to the individual subject(s) themselves, or to the study. Generally phenotypic factors may include anthropomorphic characteristics, clinical parameters including medication, lifestyle factors and sample round.
Anthropomorphic characteristics that may be assessed in the method of the present invention include age, gender, and size-related characteristics, including height, weight, hip size and waist size. Combinations of anthropomorphic characteristics or ratios thereof, such as hip-to-waist ratio or body-mass index (BMI) may be assessed in order to generate further phenotypic factors for use in the statistical analysis steps of the present invention.
Clinical parameters that may be assessed in the method of the present invention include any parameter of clinical status or health of a subject, for example blood pressure (including systolic and/or diastolic blood pressure), blood group, one or more organ function tests, e.g. lung function, liver function, kidney function, heart function, neurological function, bone density, levels of test analytes, e.g. blood lipids (including cholesterol (i.e. VLDL, IDL, LDL and/or HDL levels) and blood fatty acids), metabolite levels e.g. in various samples, or allergens, or any combination thereof. Clinical parameters may also include any medications that an individual is receiving, for example statins, insulin, chemotherapeutic agents, antibiotics, antivirals, antibody therapy, asthma medication, immunosuppressants (including steroids), blood pressure medication or medications for the treatment of any disease and painkillers. It may also be assessed whether or not a subject is pregnant.
In a particular embodiment of the present invention, it may be desirable to assign test subjects to a particular blood group, according to the ABO system. Thus the A B/O status of a test subject may be determined. It may also be desirable to subtype a test subject of the A group into A1 and A2 and the O group into O01 and O02. The blood group of a test subject may be determined by conventional (i.e. serotological) testing, or by genetic testing, as many different alleles associated with A B/O status are known in the art. A genetic test for identifying the blood group of a test subject may identify SNPs, insertions or deletions within the ABO gene. SNPs known in the art include rs505922, rs8176746, rs8176704 and rs574347.
Lifestyle factors that may be assessed in the method of the present invention include smoking, use of recreational drugs, alcohol consumption, diet and exercise, presence of household pets (e.g. applicable to an allergy etc.) and occupation.
The methods of the invention thus may include a step of determining or assessing one or more phenotypic factors for the subjects of a control population (for use in step (b)) and also for a test subject in step (f). This may include steps of performing assays or measurements to determine or assess the control or test subjects for the presence, absence, or amount or level of a phenotypic factor. For example, one or more blood tests or analyses of other clinical samples from the subject may be performed, organ function tests or other clinical assessments may be performed, anthropomorphic measurements may be taken, clinical parameters e.g. blood group or blood pressure may be assessed, as well as gathering or assembling information on lifestyle factors.
As well as assessing for the contribution of phenotypic factors to the variance of biomarker abundance levels, genetic factors are also assessed, to identify any genetic covariates for a given biomarker and/or the extent of their contribution to the variance (step (d) together with optional step (c) of the method). This analysis step is also performed using standard statistical methods, as described in more detail below. Such standard methods lead to the generation of a model which can be used to adjust or correct a normal (e.g.
control) value or a test value for a biomarker level for the effect of any phenotypic and/or genetic covariates identified.
Genetic data identifying genetic variants in a control population may be available, e.g. may have been pre-determined, depending on the control population used, for example a control population or other panel of subjects from a previous study. Such available data may be used directly in the analysis of step (c). Alternatively the method may include a step of performing one or more genetic tests on the control population in order to identify any genetic variants present in the control population. Genetic variants thereby identified are then analysed in step (d) to determine whether they have an effect on the level of a biomarker in a given clinical sample i.e. whether they contribute to, or explain any variance observed. Data obtained in such a genetic testing step may be used in combination with prior-determined genetic data.
Genetic testing methods used to detect genetic variants are well known in the art. Examples of genetic testing methods that may be used to detect genetic variants in the method of the present invention include whole-genome SNP analysis, whole exome sequencing, and whole genome sequencing. A wide range of sequencing technologies and platforms are now available, as are various techniques for detecting a particular genetic variant e.g. for detecting predetermined or known variants and any of these may be used. In particular genetic testing may be performed on a microarray, including the lllumina Infinium HapMap300v2 BeadChip, lllumina Human Omni Express BeadChip, and lllumina Human Exome Beadchip. Exome sequencing may be performed by using Agilent's SureSelect system for exome capture and the SOLiD 5500x1 instrumentation for sequencing.
If genetic covariates are identified for a biomarker, then step (f) may comprise a step of performing a genetic test on the test subject to determine the presence or absence of any one or more genetic variants identified as genetic covariates. As mentioned above, tests for detecting pre-determined genetic variants are well known, e.g. using probes or primers designed to identify a particular genetic sequence or variant. Thus such genotype
assessment tests may include the use of specific PCR primers or other variant specific amplification technologies e.g. LCR, NASBA etc, or hybridisation probes, e.g. padlock probes, molecular inversion probes, molecular beacons etc. In one embodiment a step of sequencing a genomic sequence from the test subject may be performed.
Different types of genetic variant are known and any or all of these may be identified or analysed according to the invention. Genetic variants may include single nucleotide polymorphisms (SNPs), deletions or insertions, copy number variations (CNVs), and structural variations (e.g. recombinations etc). In one embodiment of the present invention combinations of genetic variants may be identified, and thus genetic variants may also comprise haplotypes, that is to say more than one genetic variant may be present within a particular chromosome, portion of a chromosome or locus that are found to affect the level of a biomarker in a test subject. Genetic variants may be found in genic or intergenic regions of DNA. Variants found in genic regions may be found in promoter or terminator sequences, or in exonic or intronic DNA. Variants may also be found in regions of non-coding DNA transcribed into non-transcribed RNA molecules, for instance rRNA, tRNA, miRNA or piRNA.
The genetic testing and analysis performed in the course of the present studies has identified a number of novel genetic variants which may be used in determining the individualised normal level of a biomarker for a test subject. The nature and position of these variants, and the biomarker affected by each polymorphism is indicated below in Table 5. Further variants, and the biomarkers affected by each polymorphism are indicated below in Table 7.
In a separate and independent aspect, the present invention encompasses a method for detecting a biomarker in sample of body fluid or tissue from a subject, which method comprises a step of determining the presence or absence of a genetic variant selected from one or more of the genetic variants listed in Table 5 and Table 7.
The biomarker may be any one or more of the biomarkers listed in Table 5 or 7. Thus in this aspect, the present invention may provide a method of testing a subject for the presence or absence of any one or more genetic variants of Table 5 or Table 7. Such testing may be carried out in the context of determining the level of a biomarker (particularly a biomarker of Table 5 or 7) in a sample of body tissue or fluid from said subject. More particularly the method may include a step of assessing the effect of the genetic variant on the level of the biomarker in the sample and/or adjusting or correcting the biomarker level for the effect of the genetic variant.
Statistical analyses are performed in order to determine the effect of any phenotypic and/or genotypic factors on the abundance level of a biomarker. The analyses may identify the contribution made by the covariates to the observed variance, or in other words identify the variables (covariates) that explain a proportion of the variance. Statistical analysis may be performed using any of the known commercially or publically available software packages, including R, SAS (Statistical analysis software) or Statistica. Software suites associated with the R-package include GenABEL and ProABEL.
In particular the methods include a step of identification of phenotypic factors which have an effect, particularly a significant effect, on the abundance level of a biomarker in a sample from a control subject. Phenotypic factors may be identified by detecting statistical correlations between a given phenotypic factor and the abundance level of a biomarker in samples from one or more subjects e.g. control subjects. A multiple linear regression analysis may be performed in order to determine the correlation between a phenotypic covariate and the abundance level of a biomarker. Any statistical test may be performed which can identify the covariates that explain a significant proportion of the variance seen in the measured biomarker level. For instance, the significance of each phenotypic covariate's contribution to the total variance can be estimated using an ANOVA-approach as
implemented by the 'anova.glm' function on the resulting generalised linear model. An F-test may also be performed to perform a test of statistical significance in linear regression.
Any significance value may be used to judge whether a specific covariate has a significant effect for a specific biomarker or whether the difference (i.e. increase or decrease) between the adjusted values in the method of identifying a biomarker is significant. A significance value of below 0.5, preferably below 0.4, or below 0.3, or 0.2 or 0.1 may be used. In a preferred embodiment of the present invention a significance value of less than 0.1 , preferably below 0.05, may be used. P-values of below 0.05 may also include p-values of below 0.04, 0.03. 0.02 or 0.01 , or below.
P-values may be calculated in any of the ways known in the art. For example, a Bonferroni-adjusted p-value may be calculated when assessing whether a particular covariate has a significant effect for a specific biomarker. Thus in a preferred embodiment of the present invention, a covariate might be considered significant for a specific biomarker if their Bonferroni-adjusted p-values were below 0.05.
The correlation between two biomarkers may also, if desired, be calculated in the method of the present invention in which an individualised normal value of a biomarker is determined. However, this step is not essential or important for generating the model.
Abundance levels of a biomarker in a population may be rank-normalised and correlations between pairs of biomarker abundance levels may be calculated on the adjusted rank- transformed values by applying Spearman's Rho statistics on pairwise complete
observations.
Once phenotypic factors which have a significant effect on the abundance level of a biomarker within a test subject have been identified, and their contribution to the variance of abundance levels is assessed, the statistical analysis may result in the determination of a set of residual values, that is the parts of the abundance levels that cannot be explained by the phenotypic covariates. Thus step (b) may result in the determination of the residual values for the control abundance levels. The residual control abundance values may be used in their raw state, as determined in step (b), or they may optionally be normalised in an optional step (c). Thus the residual control biomarker abundance level values from step (b) which have been adjusted for the effect of the phenotypic covariates may be transformed in order to obtain a normal distribution. This is a standard statistical technique to minimise the effect of any outliers by making the data have unit variance and whether or not it is performed may depend on the distribution of the raw values, e.g. how wide the distribution is, or how many outliers etc. Various methods and packages for performing a normalisation step are known and available in the art. For example, the residual values may be rank-normally transformed, e.g. using "mtransform" function available from the R-package GenABEL.
The adjusted or corrected abundance level values from step (b) or the normalised values from step (c) are used in the step of genetic analysis in step (d) to identify genetic covariates that significantly affect the abundance level of a biomarker in a subject. The abundance level values can then also be adjusted and corrected for effect of any genetic covariates. This leads to the generation of the model in step (e). In effect the model comprises the statistical analyses of steps (b) and (d), which are used to assess the effect of the covariates (e.g. the extent of their contribution to the variance). Thus in the step of using the model, the statistical analyses of the phenotypic factors and/or genetic variants identified for or in and/or determined for the test subject are performed, essentially in the same or analogous or similar way as for the control population.
Methods, including software packages, for performing the analysis of genetic data to identify or to detect covariates are well known in the art and any number of different statistical analysis methods and packages may be used, for example plink, emmax, snptest,. Basically, any method of determining the statistical significance of a genetic variant may be used.
Software for analysing raw genotyping data is available and this may be used to provide the genetic data for the statistical analysis step, e.g. GenomeStudio (lllumina Inc) The analysis of the genetic data may comprise a genome-wide association study (GWAS), according to techniques and principles well known in the art. This analysis step may also comprise a step of imputation of the genetic data. Again methods and software for this are known and available, for example plink, emmax, snptest, Impute2 or Shapeit.
Statistical analysis may be performed on the imputed genetic data using any of the above- referenced statistical software packages, which may include functions for estimating heritability (h2) and performing genetic association analysis by adjusting for pedigree structure, e.g. Gen ABEL and ProABEL. Thus such packages may conveniently be used to take account of heritability of a biomarker. However, a step of determining heritability of a biomarker and taking it into account in the genetic data analysis step may be performed in any other known or desired way.
Significance values may be calculated and used as discussed above.
Once the effect of each of the statistically significant phenotypic and genetic covariates on the abundance level of a biomarker have been identified, and a model generated which is capable of adjusting an abundance level of said biomarker, the model generated thereby may be used to calculate the individualised normal level of a biomarker for a test subject. Thus, as discussed above, in performing the methods of the invention the abundance level of said biomarker in said test subject is determined, and the phenotypic and genetic covariates of said test subject are assessed. Preferably, the determination of the abundance level of said biomarker and phenotypic and genetic analyses will be performed using the same method as for each of the members of the control population. The abundance level of said biomarker determined for the test subject may be adjusted according to the model generated in the method of the present invention to calculate an adjusted value for the abundance level of said biomarker, based on said test subject's individual phenotypic and genetic covariates. This may then be compared to the prior- determined individualised cut-off value.
The invention will now be described in more detail in the Examples below with reference to the following drawings in which:
Figure 1 shows the characteristics of the PEA-measurements. (A) Intensities of PEA values and proportion of proteins and individuals above detection limit. In the heatmap, individuals are in columns and proteins in rows. Heatmap colors represent ddCq-values ranging from low (blue) to high (yellow) with measurements below detection limit coded white. (B) Significant covariates in relation to each protein. Covariates are listed from the upper right part of the circle (12 o'clock to 4) and connections illustrate significant (p-value <0.05, Bonferroni adjusted) contributions to PEA variance. (C) PEA to PEA correlations, colored connections represent a correlation coefficient (R2) greater than 0.5. The width of the connection reflects the magnitude of the squared correlation coefficients. All correlations coefficients (R) were positive.
Figure 2 shows Manhattan plots of GWAS results. (A) IL-6RA (B) CXCL5 (C) CCL24 and (D) E-selectin (D). X-axis labels refer to human chromosomes listed 1 -22 and X. P-values were calculated from 1 df Wald statistics chi-square values using 971 individuals.
Figure 3 shows covariates and protein biomarkers. (A) Variance explained by each of the covariates for the set of 77 biomarkers with measurable variability with the 1 1 most important covariates colored. The combined effect of the remaining covariates is shown in grey, assuming independence in effect between covariates. (B) The percent of the variance explained by the full set of covariates studied for the 77 proteins, using a combined model. (C) Abundance of CXCL10, expressed as ddCq-values, in relation to age when stratified by genotype at rs1 1548618; AA (grey) AB (red) and BB (blue). Shadowed areas represent the 95% confidence interval in a linear model predicting ddCq from age. (D) Fitted normal distribution densities based on mean and standard deviation in ddCq-values for CXCL10, split by the rs1 1548618 genotype. (E) Fitted normal distribution densities based on mean and standard deviation in ddCq-values for CCL24 split by the rs6946822 genotype (F) Fitted normal distribution densities based on mean and standard deviation in ddCq-values for IL-6 split by use of hypertension medications. Only groups where there are at least 10 individuals are shown. C07AB: Beta blocking agents, selective. C08CA: dihydropyridine derivatives. C09AA: ACE inhibitors, plain. (D) - (F): Inter quartile ranges indicated with colored boxes above the curves. Figure 4 shows the number of significant epidemiological associations in proteins with significant case-control difference using PNPPP and unadjusted (Raw) abundance levels. A) Cataract (PNPPP/unadjusted), 5 and 78 associations for PNPPP and unadjusted protein levels respectively. B) Diabetes (PNPPP/unadjusted), 7 and 57. C) Hypertension (PNPPP/unadjusted), 30 and 95. D) Myocardial Infarction (PNPPP/unadjusted), 15 and 60. E) Stroke (PNPPP/unadjusted) 6 and 51.
Figure 5 shows the case-control differences using unadjusted (Raw) and the PNPPP method. Absolute differences in mean value (case - control) in A) Cataract, B) Diabetes, C) Hypertension, D) Myocardial Infarction, E) Stroke. X-scale is in log2 PEA values, all 125 proteins are stacked, sorted by mean difference in Raw values (left side). Corresponding PNPPP-differences are drawn on the right side. Values below the dashed grey lines have negative sign, e.g. control values are higher than in cases. Black colour indicates significant difference (two-sided Ranked Wilcox test, p-value < 8 x 10-5).
Figure 6 shows examples of using PNPPP of determining biomarker cutoffs. A) t-PA in Myocardial Infarction and Controls (left y-axis) against weight (x-axis). B) TIM-1 (left y-axis) in Cataract and Controls against SBP (x-axis). C) GDF-15 in Diabetes and Controls (left y- axis) against SBP (x-axis). D) Growth Hormone in Hypertension and Controls (left y-axis) against weight (x-axis). A-D) Solid lines represent PNPPP-values and dashed raw values. Blue values are controls and red represent cases. Grey bars (right y-axis) depict disease incidence in %.
Examples
Example 1 .
Introduction
We used the highly sensitive and specific Proximity Extension Assay (PEA) (Lundberg et al. 201 1. Nucleic acids research 39, e102) to estimate the abundance of 92 established or potential biomarkers in plasma from 1005 individuals from a longitudinal cross-sectional population-based study in Sweden. The biomarkers we analyze here constitute a research panel directed against multiple cancers and also contain proteins implicated in autoimmune diseases such as rheumatoid arthritis and Graves' disease. PEA combines two dedicated antibodies with a real-time qPCR reaction to achieve high specificity and a wide dynamic range. This technology can be multiplexed without introducing crosstalk, while still maintaining its high specificity and sensitivity. We first determine the effect of a wide range of clinical variables and lifestyle factors including age, sex, blood pressure, blood group or BMI, medication and smoking, on biomarker levels. Then we study the heritability of each biomarker, and using high-resolution genetic SNP array data and whole exome sequencing we perform a genome-wide association study (GWAS) for each biomarker. Through integration of genetic, clinical and lifestyle data we identify the set of biomarker-specific factors that can be used to determine appropriate individual clinical cut-offs, and thereby enable a more efficient use of each biomarker in personalized cancer management.
Methods
Samples
The Northern Sweden Population Health Study (NSPHS) was initiated in 2006 to provide a health survey of the population in the parish of Karesuando, county of Norrbotten, Sweden, and to study the medical consequences of lifestyle and genetics. This parish has about 1 ,500 inhabitants who meet the eligibility criteria in terms of age (≥15 y), of which 719 individuals participated in the study (KA06 cohort). As a second phase of the NSPHS, another 350 individuals from a neighboring village (Soppero) were recruited in 2009 (KA09 cohort). For each participant in the NSPHS, blood samples were taken (serum and plasma) and stored at -70°C on site. Both the 2006 and 2009 samples used in this study have undergone 2 freeze-thaw cycles prior to the measurements carried out here. DNA has been extracted for genetic analyses and detailed descriptions of this study have been published elsewhere (Johanson, A. et al. 2009. Hum. Mol. Genet. 18, 373-380, Igl, W. et al. 2010. Rural and remote health 10, 1363, Enroth, S. et al. 2013. Int. J. Circumpolar Health 72). A questionnaire was used to collect data on medications and lifestyle. The questionnaire was filled in at the local health care center in the presence of the local district nurse. Notably, around 15% of the participants of the study adhere to a traditional lifestyle (TLS) based on reindeer heading and crafts. Differences in e.g. diet in this group compared to the group with a lifestyle typical of more industrialized regions have been shown to increase levels of circulating blood lipids which motivates to include the TLS adherence as a covariate.
Ethical considerations
The NSPHS study was approved by the local ethics committee at the University of
Uppsala (Regionala Etikprovningsnamnden, Uppsala, 2005:325) in compliance with the Declaration of Helsinki. All participants gave their written informed consent to the study including the examination of environmental and genetic causes of disease. In cases where the participant was not of age, a legal guardian signed additionally. The procedure that was used to obtain informed consent and the respective informed consent form has recently been discussed in light of present ethical guidelines.
Multiplexed proximity extension assay
Protein levels in plasma were analyzed using the Olink Proseek Multiplex Oncology 1 96x96 kit and quantified by real-time PCR using the Fluidigm BioMark™ HD realtime PCR platform as described earlier (Assarsson, E. et al. 2014. PLoS One 9, e95192). In brief, for each measured protein, a pair of oligonucleotide-labelled antibodies probes bind to the targeted protein and if the two probes are in close proximity a PCR target sequence is formed by a proximity-dependent DNA polymerization event and the resulting sequence is subsequently detected and quantified using standard real-time PCR. Each plate contains 96 wells whereof 92 are samples, 1 is a negative control and 3 are positive controls (spiked in IL-6, IL-8 and VEGFA). Each sample is also spiked in with 2 incubation controls (green fluorescent protein (GFP) and phycoerythrin (PE)), one extension control and one detection control. These controls are used to determine the lower detection limit (negative control) and to normalize the measurements into delta delta Cq (ddCq) values according to the following formulae ddCq = dCqh ah - (~)(dCqanal¥te) (1 ) where
(~}{s^f?an,al5*e.) ~~ ^^anaiy e ~~ ^ ^Extension Control (2) and dCqMa is a per-assay value defined by the manufacturer to give a positive log2-scale. A list of the 92 proteins quantified by the PEA is shown below in Table 4. The ddCq values where then log2-transformed for subsequent analysis. Each PEA (proximity extension assay) measurement has a specified lower detection limit calculated based on negative controls that are included in each run and measurements below this limit were removed from further analysis. Individual samples where at least one of the internal controls contained an outlier value (n=35) or where too many (> 75%) measurements were below detection limits in any PEA (n=1 ) were also excluded from further analyses (total n=35 of 1005, 3.5%). We wanted at least 200 observations per protein above detection limit in order to conduct the downstream statistical analyses and therefore proteins with fewer observations were excluded from further analyses. After individual and protein quality control, 77 proteins measured in 970 individuals remained. Out of the removed proteins, 7 (BTC, EPR, IL-2, CA242, ER, G-CSF and SL-1 ) had 100% of measurements below detection limit. Uniprot recommended short names have been used throughout when these are available otherwise; the assay manufacturers' abbreviations have been used. All assay characteristics including detection limits and measurements of assay performance and validations are available from the manufacturers webpage.
Genotype data
The KA06 and KA09 cohorts have previously been genotyped on the lllumina Infinium HapMap300v2 BeadChip (308,531 markers) and lllumina Human OmniExpress BeadChip (731 ,442 markers) arrays respectively as described earlier (Johansson, A. et al. 2013. Proc. Natl. Acad. Sci. USA 1 10, 4673-4678). In brief, the specific KA06 and KA09 data was quality checked separately leaving 691 individuals with 306,086 SNPs at 99.50% genotyping rate and 346 individuals 631 ,503 SNPs at 99.88% genotyping rate respectively. 4 individuals were present in both cohorts and these were removed from the KA06 data. Here, we also genotyped the individuals from both cohorts (n = 1059) on the lllumina Human Exome Beadchip containing 247,901 SNPs, insertions and deletions primarily selected to have coding changes. The genotype calling was done with the software GenomeStudio 201 1.1 (lllumina Inc.) using a Project Sample generated Cluster File (PCF) as recommended by the manufacturer. The Exomechip data was quality controlled requiring 95 and 98% genotyping rate on marker and individual levels respectively and a Bonferroni-corrected Hardy-Weinberg cut-off of 0.05 leaving 242,519 markers at a total genotyping rate of 99.94% in the 1033 unique individuals previously genotyped. This analysis was carried out using custom R-scripts and PLINK (v1.07) (Purcell, S. et al. 2007. American journal of human genetics 81 , 559-575).
Exome sequencing
We selected 100 individuals, 68 from KA06 and 32 from KA09, for Whole Exome Sequencing using Agilent's SureSelect system for exome capture and the SOLiD 5500x1 instrumentation for sequencing. Each sample was sequenced to at least 30X coverage. The individuals were selected to represent as much genetic variation of the cohort as possible. Alignment was done using the LifeScope software and SNPs and INDELs were called using diBayes. For each position (n>1.5M) where any individual so far sequenced at the Uppsala Genome Centre had called SNP or INDEL, we then checked our 100 individuals for coverage in order to differentiate between missing and reference calls. These positions were included to maximize the overlap with the 1000 genomes reference panels to ensure proper imputation using two reference panels. Reference calls for SNPs were made if there were at least 3 reference sequence reads with unique start points and a maximum of 5% reads with non-reference at that position. Reference calls for INDELs were made if there were no reads at all without the reference call. All other calls were set to missing. We then required at most 5% missing call rate per SNP or INDEL. This resulted in 83'568 SNPs with non-zero MAF at 98.74% total genotyping rate and 38'290 INDELs with a total genotyping rate at 99.45 % and an additional 350k positions with reference calls only. We then required a genotyping rate of 95% in both individual and marker level and a Bonferroni corrected Hardy-Weinberg cut-off at 0.05, which resulted in 468'630 markers at total genotyping rate of 98.79%.
Imputation of genotype data
We created an in-house reference panel to be used simultaneously with the 1000 genomes reference panel (Genomes project C et al. 2010. Nature 467 ,1061 -1073, IMPUTE2 (https://mathqen.stats.ox.ac.uk/impute/impute v2.html-reference) 2012). The in- house panel was based on the 100 exomed individuals by merging the SNPs and INDELs called from the exomes with the SNPs common between the lllumina Human HapMap300v2 (used in the KA06 cohort) and the lllumina Human OmniExpress (used in KA09 cohort), n = 182 16, and all the markers from the lllumina Human Exome chip. In this step there was no additional filtering done on minor allele frequency in order to maximize the overlap with the SNPs in the 1000 genome panel. The total number of markers in the in-house reference panel was 847'855. The reference haplotypes were created using in-house R-scripts, PLINK (v1 .07) and phased using SHAPEIT (v2.r) Delaneau, O. et al. 2013. Nature methods 10, 5- 6). Data was then imputed for the two cohorts separately using IMPUTE2 (v2.3.0) with a pre- phasing approach (Howie et al. 2012. Nature genetics 44, 955-959). The input data was phased chromosome-wise using SHAPEIT (v2.r). In addition to our in-house panel we also utilized the 1000 Genomes Phase I integrated variant set (National Center for Biotechnology Information build b37, March 2012) accessed from the IMPUTE Web resource (IMPUTE2 (https://mathqen.stats.ox. ac.uk/impute/impute v2.html-reference) 2012). IMPUTE2 was run with the default parameters with the following changes "— merge-ref-panels" and "-k_hap 500 200". The latter instructing IMPUTE2 to use 500 haplotypes from the 1000G reference panel and all 200 from our in-house panel. Data was imputed in chunks of around 5M bases ensuring at least 200 genotyped SNPs in each chunk. No chunks spanned across the centromeres. The paraautosomal and non-paraautosomal regions on chromosome X were handled separately. The resulting data was filtered on marker level by requiring IMPUTE's 'info' score >0.3 in both the KA06 and KA09 cohorts before merging. Merging of the imputed data was done using GTOOL (vO.7.5) (Freeman, C, Marchini, J. 2013. GTOOL (http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.htmo)) requiring a dosage threshold above 0.9 in at least 95% of the individuals. The resulting merged data was further filtered using QCTOOL (v1 .3) (Band, G., Marchini J. 2013 QTOOL (http://wwww.well.ox.ac.uk/~gav/qtool/) requiring a Bonferroni corrected Hardy-Weinberg cut-off of 0.05 and a minor-allele frequency corresponding to at least one chromosome in the whole material. The final dataset included 4'840'842 SNPs and INDELs. ABO blood group assignment
We assigned blood groups according to the ABO-system to our samples based on their genetic status of 4 genotyped SNPs (rs505922, rs8176746, rs8176704 and rs574347) in the region of the ABO gene. These four SNPs allows for accurate assignment of both the A B/O groups and subtyping of A into A1 and A2 and subtyping of O into O01 and O02. Using this approach we successfully assigned blood groups to 97.9 % of our samples.
Statistical analyses
All statistical analysis was conducted in R (R Development Core Team 2012. T foundation for statistical computing) and illustrations were produced using R and the Circos software (Kryzyinski, M. et al. 2009. Genome Res. 19, 1639-1645). Correlation between proteins and relevant variables was calculated separately for each measured protein by fitting a generalized linear model using the 'glrm' function including all covariates simultaneously. The significance of the each covariate's contribution to the total variance was estimated using an ANOVA-approach as implemented by the 'anova.glm' function on the resulting generalized linear model. Covariates were considered significant for a specific protein if their Bonferroni-adjusted p-values were below 0.05 (p-value < 3.16 x 10"4, 0.05/158). Each PEA measurement was individually adjusted for significant covariates and rank-transformed to normality by using the 'rntransform' function available from the R- package GenABEL (v1 .6.7) (Aulchenko, Y. S. et al. 2007. Bioinformatics 23, 1294-1296). Correlations between pairs of PEA measurements were carried out, on the adjusted and rank-transformed values, using the 'cor' function applying Spearman's Rho statistics on pairwise complete observations.
The NSPHS is a population based study and includes many relatives and special care has to be attributed to avoid relational biases. Therefore, all genetic associations calculations was carried out using the GenABEL or ProbABEL (Aulchenko, supra) software suites, which has been developed to enable statistical analyses of genetic data of related individuals. These packages includes functions for estimating the narrow-sense heritability (h2) and performing genetic association analyses Chen, W. M., Abecasis, G. R. 2007. American journal of human genetics 81 , 913-926) by adjusting for pedigree structure. In brief, the heritability of each trait (protein abundance) is estimated using a polygenic model as implemented by the 'polygenic' method in the GenABEL R-package. This heritability estimate represents the variance in the phenotype that is explained by genetic factors and is estimated by maximizing the likelihood of the trait-data under a polygenic model including fixed effects such as covariates and relatedness among individuals (kinship). The result of the 'polygenic'-call contains the inverse variance-covariance matrix of the estimates and trait residuals and is included in the downstream association calculations together with the posterior genotypic probabilities. Specifically, these calculations are performed using the ProbABEL program using the '--mmscore' option. Kinship matrix calculations were carried out using the autosomal markers shared (n = 182,916) between the two types of genotyping arrays used in the KA06 and KA09 cohorts. Contribution of single SNP's to phenotypic variation on the unadjusted ddCq-values was calculated in R by fitting a linear model (using ΊΓΤ ) with ddCq values as response and the posterior genotypic probabilities as terms and fraction of variance explained was determined from the resulting model using 'summary. Im'. Fraction of variance explained by a single SNP in the adjusted phenotypes including effects of relatedness was estimated by dividing the resulting chi-square test score (from ProbABEL) with the number of samples used.
The KA06 cohort was used as discovery cohort in the genome-wide association studies (GWAS) and KA09 as replication cohort. Since we cannot rule out protein degradation effects due to differences in storage time between the two cohorts, this split is favorable to a random split where degradation effects could affect the association analysis. Strict Bonferroni-adjusted p-values (p-value < 1 .03 x 10"8, 0.05/4,840,842) were used to report significance in the discovery cohort and the replication cohort (p-value < 0.05 / number of significant SNPs in the discovery cohort). We also ran a combined analysis with the same cut-off used as in the discovery phase. For all proteins with replicated hits a conditional analysis was carried out in which the genetic associations were re-calculated using the dosage values of the top-ranking SNP as covariate. This analysis was only run in the combined material and on chromosomes that had hits in that replicated in the discovery- replication phase and p-value < 5 x 10"8 was used as cut-off.
Results
Biomarker measurements
The abundance of 92 proteins, representing a panel of established and potential biomarkers for cancer and inflammation, were measured in blood plasma of 1005 individuals from the Northern Sweden Population Health Study (NSPHS), using PEA and qPCR. A total of 77 of the proteins had levels above the detection limit in at least 80% of our samples, with 91 .3% (70651 of 77385) of qPCR reactions being successful. In the remaining 15 proteins, 96.8% (14598 of 15075) of the protein levels were below the detection limit. Also, 96.5% (970 of 1005) of our samples passed quality control on an individual level. The abundance and distribution of the normalized measurements (ddCq-values) of all the proteins in all samples are illustrated in Figure 1 A, with estimates under the detection limits colored white. Details on normalization and initial quality control are given below in the Methods section. The proteins with little or no measurable abundance in our samples were: SL-1 (Stromelysin- 1 ), GM-CSF (Granulocyte-macrophage colony-stimulating factor), ER (Estrogen receptor), CA242 (Cancer Antigen 242), IL-2 (lnterleukin-2), EPR (Epiregulin), BTC (Betacellulin), IL-4 (lnterleukin-4), IFN-gamma (Interferon gamma), IL-7 (lnterleukin-7), TNF (Tumor necrosis factor), CEA (Carcinoembryonic antigen-related cell adhesion molecule 5), MYD88 (Myeloid differentiation primary response protein MyD88), MUC-16 (Mucin-16) and REG-4 (Regenerating islet-derived protein 4). It cannot be ruled out that storage-time and protein degradation could be an influencing factor for these 15 proteins, and previous studies have quantified this specifically for CEA.
Epidemiological associations
To study the effect of clinical and lifestyle factors, we selected 158 phenotypic covariates, including age, sex, blood pressure, BMI, tobacco use, medication, lifestyle (occupation) and sample collection round (2006 or 2009) from the comprehensive set of clinical data available for NSPHS. A multiple linear regression model showed a total of 18 phenotypic covariates to have a significant effect (p-value < 0.05, Bonferroni adjusted) on one or more of 52 of the 77 proteins (Table 1 ). Factors such as age or weight influenced a broad range of proteins, while medication affected specific proteins (Figure 1 B). Notably, smoking affected two proteins, WFDC2 (WAP four-disulfide core domain protein 2) and IL-12 (lnterleukin-12), while the traditional Swedish moist tobacco product, "snus" did not have any significant effects, in line with a previous study on effects of tobacco use on DNA methylation. We also found large effects (nominal p-value ranging from 1 .8 x 10"4 to 2.3 x 10" 7) of ABO blood group on 3 proteins; E-selectin, PECAM-1 (Platelet endothelial cell adhesion molecule) and TIE2 (Angiopoietin-1 receptor). The connection between E-selectin and blood groups is known, but the effect on PECAM-1 and TIE2 has not been described previously. The medication in NHPHS had been investigated using a questionnaire and the reported medications were annotated using the Anatomical Therapeutic Chemical classification system (ATC). Among the commonly used medications, Dihydropyridine derivatives (ATC: C08CA, 54 users), often used to treat hypertension, was correlated to increased IL-6 (lnterleukin-6) levels, while Glucocorticoids (ATC: R03BA, 26 users) lowered both Basigin and HGF receptor (Hepatocyte growth factor receptor) levels. Apart from C08CA, no other hypertensive treatment was correlated with high IL-6 levels. Interestingly, the usage of Selective beta-2-adrenoreceptor agonist (ATC: R03AC, 13 users), which is commonly found in Asthma inhalators, decreased the level of circulating VEGF-D (Vascular endothelial growth factor D), which is implicated in the metastasis of non-small lung cancer. The largest fraction of variance explained by a single clinical or environmental covariate was age, which accounted for 27% of the variation seen for WFDC2. The influence on WFDC2 of age and smoking has previously been reported, but we found that the fraction of variance explained by smoking in our data to be only 1 .7%, which is much less than for systolic blood pressure (14.3 %) or loop-diuretics (ATC: C03CA, plain Sulfonamides, 7.2%). However, these covariates are not necessarily independent as blood pressure and use of medication is related to age.
Correlations between biomarkers
Inter-biomarker correlation was investigated using abundance levels adjusted for significant clinical and lifestyle covariates. These was then rank-transformed into normally distributed values and used to identify 12 pairs with a Spearman's Rho R2 greater than 0.5 (Figure 1 C). The highest correlation was found between CASP-3 (Caspase-3) and CD69 (Early activation antigen CD69) (R2 = 0.85). CASP-3 was also highly correlated with EGF (Epidermal growth factor, R2 = 0.81 ), which in turn was highly correlated with CD69 (R2 = 0.78). The strong correlation between some of the biomarkers does not appear to be reflected at the transcription levels. For instance, the lllumina Body Map suggests that CD69 and Caspase-3 both are expressed in leukocytes, lymph nodes and adrenal glands (e.g. 3 of 16 investigated tissues). In data from leukocytes of 80 controls there was only a weak correlation between the expression levels of CD69 and CASP-3 (R2 = 0.13), suggesting that the high correlation observed at the protein level is either due to post transcriptional regulation, e.g. epigenetic regulation, or due to expression patterns in distinct cell types. Several of the 12 pairs that were highly correlated were proteins with similar functions, such as CXCL9, 10, 1 1 , and TNF-R1 and TNF-R2, while in other cases apparently unrelated proteins were highly correlated. These correlations may reflect as yet unknown patterns of co-regulation, and bring into question their value as independent biomarkers.
Heritability and genetic association
All 970 individuals that passed the QC were used to estimate the heritability for the
77 proteins with measurable levels by evaluating the co-segregation of the protein levels with the relatedness among individuals using a polygenic model (see Methods for details). In 75% (58/77) of the proteins, the levels were found to be heritable (Bonferroni adjusted p- value < 0.05), with heritability ranging from 0.19 to 0.78 and the highest heritability for CCL24 (C-C motif chemokine 24). Thus, for a majority of the protein biomarkers, circulating levels are significantly affected by the individual's genetic constitution. To determine the nature of the genetic effects on protein abundance, we performed association analyses using over 4.8M SNPs and INDELs identified by direct genotyping and whole exome sequencing, followed by high-quality imputation. In this analysis, each of the 77 proteins was adjusted for the significant clinical and lifestyle variables (Table 1 ) and the samples were split into a discovery and a replication cohort based on sample collection round (see Methods for details). In the discovery phase, we identified 15 proteins with genome-wide significant hits (nominal p-value down to 1 .1 x 10"40, Table 2), employing a Bonferroni corrected p-value cut-off of 0.05. Of these, 14 had at least one replicated association (nominal p-value down to 1 .1 x 10"20, Table 2). In all, 175 genome-wide significant hits were detected in the discovery phase, out of which 101 replicated. A combined analysis of all individuals revealed a total of 226 genome-wide significant hits in 14 proteins, with p-values down to 4.4 x 10"58, and a single marker explaining as much as 26.6% of the phenotypic variation seen after adjusting for the significant clinical and lifestyle factors (Table 2). A detailed description of each of the 226 hits, including overlaps with previous associations with any phenotype or trait, is given in Table 5. IL-6RA (lnterleukin-6 receptor subunit alpha) showed the strongest association and the association was caused by one or very few SNPs located in the gene that encodes the respective protein, similar to the case for the majority of the biomarkers (Figure 2A). Conditioning on the top-hit revealed that four of the proteins; CCL24, MIC-A, CXCL5 (C-X-C motif chemokine 5) and Ep-CAM (Epithelial cell adhesion molecule) had hits independent of the highest-ranking SNP (Table 3). For CXCL5 (Figure 2B) and Ep-CAM, the second SNP was located on a different chromosome, while for CCL24 (Figure 2C) and MIC-A the second SNPs were located close (< 40kb, Table 3) to the first hit. The second SNP for Ep-CAM explained 6.5% of the variance of the unadjusted phenotype, as compared to 4.9% for the top-ranking SNP. For the other 3 proteins the fraction of variance explained by the second-ranking SNPs was small compared to the top-ranking SNP. For 12 of the 14 biomarkers with a strong genetic association (CCL24, CD40-L (CD40 ligand), CXCL5, CXCL10, Ep-CAM, IL-12B, IL-17RB (lnterleukin-17 receptor B), IL-6RA, hK1 1 (Kallikrein-1 1 ), MIA (Melanoma-derived growth regulatory protein), MIC-A (MHC class I polypeptide-related sequence A) and VGEF-D), the top SNPs were located in cis with the gene encoding the protein. We compared our 226 hits with eQTLs as reported by the NCBI's eQTL database and found overlapping SNPs in 1 1 cases. These were reported for IL-17RB (1 SNP) and CCL24 (1 SNP) in liver and for MIC-A (9 SNPs) in lymphoblastoids Table 5. Since expression is cell type specific and eQTL studies only exist for a limited set of tissues, the number of SNPs found here to be eQTL is likely to be an underestimate. For two of the proteins (CCL19 (C-C motif chemokine 19) and E-selectin), the genome-wide significant hits were located at other loci than the one coding for the protein (Table 3). The top hits for CCL19 were located in the MHC (major histocompatibility complex) class II gene cluster, encoding molecules present on antigen-presenting cells and B cell lymphocytes. CCL19 is a chemokine implicated in inflammatory and immunological responses, but also in normal lymphocyte recirculation and homing. Higher serum levels of CCL19 have been associated with poor prognostics of AIDS patients. For E-selectin, the circulating level is known to be affected by ABO blood group. Here, even after correction for blood group at the Α Β/0-level, the top hits in the GWAS were located within the ABO-gene, determining the blood group (Figure 2D), with our top hit (rs507666) being a perfect tag SNP for the A1 subtype, suggesting that the specification of the A group into A1 and A2 is involved. Our dependency of the E-selectin levels on ABO status is consistent with the pattern described earlier, where individuals with the O blood type have the highest levels. This is in contrast to the patterns for TIE2 and PECAM-1 , where individuals carrying the B or AB blood group have the highest values. For the other proteins (Ep-CAM, CCL19 and CXCL5) we found no evidence such as eQTLs or common pathways linking the loci that did not code the protein to the gene coding the protein. In summary, for a large number of the biomarkers significant genetic effects on protein levels could be identified.
Personalized biomarker-specific covariate profiles
The relative importance of individual genetic, clinical and lifestyle factors on the abundance differed dramatically between the 77 biomarkers (Figure 3A). Some biomarkers were affected by strong genetic factors while others mainly by environmental or clinical factors. These variables are not always independent, such as blood pressure and use of medication, which are both related to age. This can be seen in that the total fraction of observed variance, as determined by a combined model including all 158 covariates plus the top-ranking SNP and the top-ranking SNP from the conditional analysis, lie between 0.20 and 0.56 (Figure 3B), whilst the sum of the explained variance by individual covariates in some cases reached above 1 (Figure 3A). Figure 3 illustrates the main results of the study. Most of the 77 biomarkers showed large variation in abundance between individuals but they differed considerably with regard to the specific genetic, clinical or lifestyle factors involved. At one extreme, IL-6RA levels were affected most strongly by the individuals' genotype and only a very small fraction of the variance was explained by other covariates, with BMI being the strongest (1.0%). Even assuming that all 158 covariates, besides the top ranking SNP, contributed independent effects, the sum of the fraction of the variance of IL-6RA levels explained by these factors was less (21.0%) than the single genetic effect (21.3%). At the other end of the spectrum, HGF (Hepatocyte growth factor) did not show a significant heritability, and none of the genetic markers reached genome-wide significance. However, 17 other covariates were nominally significant (p-value < 0.05) for HGF and four (weight, sample round, systolic blood pressure and age) remained significant after correction for multiple testing. These covariates accounted for 3.3, 2.9, 14.1 and 19.3%, respectively, of the percentage-of-variance-explained. In addition, the use of platelet aggregation inhibitors (ATC: B01AC) and loop-diuretics (ATC: C03CA) explained 5.7 and 7.1 % of the variance observed in the unadjusted ddCq-values, respectively, while the top ranking SNP only accounted for 1 .6 %. In the middle part of the distribution in Figure 3A, we find biomarkers that were less affected by the genetic, clinical or environmental factors studied, possibly reflecting limited non-disease related variability.
Information on the set of important variables for each biomarker can be used to reduce the non-disease related variation. For instance, soluble CXCL10, which shows elevated levels in patients with a number of autoimmune related diseases, has previously been shown to be associated with systolic blood pressure. Here, we confirm the correlation with systolic blood pressure, which explains 5.6% or the variability, but we also found a significant correlation with age (9.0% of variability) and a very strong effect of genetic variants (35.4% of variability). Stratifying individuals on age did not appreciably reduce the range of variability (Figure 3C). However, stratifying on the basis of the genotype at the top hit (rs1 1548618) had a considerable effect on reducing the variability (Figure 3D). In the case of CCL24, the carriers of the reference allele of rs6946822 had a level 209% (linearized ddCq) of the average value of the homozygote carriers of the alternative allele (Figure 3E). The effect of medication on the abundance can be demonstrated by IL-6 (Figure 3F), where the distribution of protein level was clearly shifted upwards with the use of Dihydropyridine derivatives (ATC: C08CA) found in drugs prescribed for treatment of hypertension or angina pectoris. Interestingly, this was the only hypertension medication that is correlated with higher IL-6 levels, and neither ACE-inhibitors (ATC: C09AA), selective beta-blockers agents (ATC: C07AB) nor a combination of these, mediate this effect (Figure 3F). This implies that detailed medication information may be needed for proper use of this biomarker.
Discussion
We have shown that for 72 of the 77 biomarkers studied, the circulating plasma levels are strongly associated with genetic, clinical or lifestyle factors. Most biomarkers are highly heritable, and for 14 we identified strong genetic associations, with the top SNP explaining as much as 36% of the variability in protein abundance between individuals. For these biomarkers, stratifying patients based on their genotype may dramatically enhance the ability to detect deviations from normal circulating levels. A number of non-genetic factors also show a strong effect on biomarker levels, with age, systolic blood pressure and weight affecting a large number of the biomarkers. As cancer incidences increase with age so does the use of prescribed medications (Spearman's rho, R2 = 0.29). Interestingly, we identified medication as an important clinical variable that should be considered when using the biomarkers for diagnosis or risk prediction. For instance, Basigin expression has been associated with shorter survival and proposed as a biomarker for adjuvant therapy in colorectal cancer. Our analysis did not show any significant association of Basigin levels with covariates such as anthropometrics, age, sex or smoking. However, the use of glucocorticoids commonly found in inhalators used to treat asthma-related conditions, decreased circulating levels of Basigin thereby possibly masking the need for adjuvant treatment. Our results indicate that when using Basigin as a biomarker in an ageing population, medication history and dosage should be taken into account in order to establish an appropriate clinical cut-off. Another example is the IL-6 and IL-6 receptor (IL6-RA), where we confirm the strong effect of the genetic constitution on the circulating IL6-RA levels. We also show that medications used to treat e.g. hypertension such as dihydropyridine derivatives, but not ACE-inhibitors or selective beta-blockers agents, cause or maintain an increase in the inflammatory response cascade via high IL-6 levels. The IL-6 signaling is important in the pathogenesis of several autoimmune and chronic inflammatory diseases and antibody based drugs are used to target the IL-6 receptor in patients with rheumatoid arthritis (RA) in order to dampen the inflammatory response. In clinical practice only two thirds of the patients treated with these drugs respond to the treatment and factors such as age and medical history have been shown to be predictors of remission and response in RA patients. Future investigations are clearly needed to specifically address the long-term effects of commonly used medications in this perspective.
In a clinical context, circulating levels of CXCL10 have been estimated to 120±83 pg/ml in patients diagnosed with Graves' Disease as compared to 72±32 pg/ml in controls; an average increase of 67% not taking genetic and non-genetic covariates into account. By comparison, the average increase in individuals in our study carrying the reference genotype for rs1 1548618 was 178% (linearized ddCq) of the level in heterozygous individuals, clearly illustrating the relative importance of carrier genotype versus the disease state on biomarker levels. Previous efforts have identified genetic susceptibility loci for Graves' disease but none of these overlap with the loci associated with the CXCL10 levels, suggesting that in this case the causal effects of the disease are not directly linked to the biomarker levels. Another strong genetic effect was observed for CCL24, where carriers of the reference allele of rs6946822 have a level 209% (linearized ddCq) of the average value of the homozygous carriers of the alternative allele (Figure 3E). The worldwide minor allele frequency of rs6946822 is listed in dbSNP as 0.46, implying that every 5th individual will be homozygote, similar in frequency to the individuals who smoke in the U.S today, demonstrating the large, common genetic effects on biomarker variation found in the population today.
We also find biomarkers that are not significantly affected by any of the variables examined, rendering them less susceptible to variability induced by non-disease related factors. Although we have investigated a large number of genetic, clinical and lifestyle factors, they altogether explain at most 56% of the variation in biomarker levels between individuals. The remaining variance must reflect other factors, or non-additive interaction between some of the factors studied, and their identification could further increase the utility of biomarkers by reducing sources of variation unrelated to disease state. For example, CCL24 had a heritability of 0.78, indicating that additional genetic loci might affect protein levels. For 15 of the biomarkers the vast majority of abundances were below the detection limits in our cohort. Several of these could represent ideal biomarkers without major presence in normal plasma and thus with no influencing genetic or lifestyle factors. Among these was for instance MUC-16 (or CA125) that is used clinically as a test for ovarian cancer and also potential biomarkers such as REG-4 that has been proposed as a biomarker for pancreatic ductal adenocarcinoma.
This study identifies several previously unknown genetic and lifestyle factors influencing the circulating plasma levels of disease biomarkers in a population-based cohort, but has its limitations. First, we have a relatively small sample size (N = 1005) for genetic association studies. Despite this fact, we identify and replicate 12 novel associations of large effect on the disease biomarkers. While large GWAS consortia have identified hundreds of genetic variants associated with variation in disease related phenotypes, most of these SNPs are common and have such small effect sizes that they are not clinically useful.
Personalized cancer medicine is on a trajectory from long awaited promise to existing reality, with clinical applications for a small number of cancers with directed treatments. In chronic myelogenous leukemia (CML), patients with a specific translocation respond well to treatment with a tyrosine-kinase inhibitor blocking an enzyme that in turns triggers signaling cascades. Also, patients with non-small-cell lung cancer (NSCLC) and a gene-fusion mutation have higher drug response rates than those lacking this gene-fusion. However, the number of cancer biomarkers in clinical use is still limited. In the set of biomarkers studied here, we identified a surprisingly strong genetic effect on some biomarkers after correcting for clinical (medication) and lifestyle variables. Likewise, other biomarkers were strongly affected by environmental lifestyle or clinical factors. Genotyping of selected polymorphisms with a strong effect on abundance appears to be crucial for about 20% of the biomarkers in our study, while lifestyle and medication are important covariates for the majority. In the daily clinical routine, we envision that analysis of broad-spectrum biomarkers could be used as a follow-up analysis for patients, or for screening of risk groups. Our analyses indicate that such tests would be accompanied by collecting additional relevant information such as anthropometrics, medication and genotyping of specific polymorphisms known to affect the baseline of these biomarkers. The clinical laboratory that performs the biomarker analysis would have documentation on which cofactors that significantly influence the baseline levels, and could advise the physician on how to interpret the outcome of the test. Our results imply that using biomarker-specific covariate profiles will make it possible to determine more precisely, individualized, clinical cut-off levels. This in term could lead to a more efficient use of protein biomarkers for early detection of abnormal levels and for increased sensitivity and specificity in disease diagnosis. By employing biomarker-specific profiles of covariates it will be possible to fully harness the potential of existing and novel biomarkers for disease diagnosis and management.
Example 2. Assessment of the effect of phenotypic and genetic factors on the abundance levels of proteins associated with cardiovascular disease.
A similar study was performed to assess the effect of various phenotypic and genetic factors on the abundance level of proteins associated with cardiovascular disease. The abundance in plasma samples collected from the NHPHS control population of established or potential biomarkers associated with cardiovascular disease was measured using PEA, as in Example 1. A list of proteins assessed in this study is shown in Table 6. Phenotypic and genetic covariates that significantly affected the level of a biomarker were identified in the same way as in Example 1 .
Example 3. Example calculations of an individualized model.
Once the effect of each covariate has been established for a given biomarker, it is possible to generate a model which is capable of adjusting the abundance level of a biomarker in a sample for the effect of the phenotypic and/or genetic covariates. This model may subsequently be used to determine a value for the normal level for the abundance of a biomarker in a sample taken from a test subject. An example of such a model (for FAPB4) is shown below. A 'new' value can be calculated from the measured (or 'observed') value for a biomarker in a test subject, adjusted to take into account phenotypic and genetic covariates.
1 ) use the model generated using the control population to calculate what part of the observed signal that we can explain with e.g. anthropometrics:
FAPB4:
partOfSignalExplainedByNonDiseaseModel = -0.03638*[length in cm] + 0.03613[weight in kg] + 0.36965*[sex, female yes/no]
2) remove that part from the observed individual value replacing the e.g. length or weight with the individuals' characteristics.
newValue = rawObservedValue - partOfSignalExplainedByNonDiseaseModel Example 4. Identification of biomarkers for disease.
Introduction
We have studied the contribution of genetics, lifestyle and medication on circulating levels of known and exploratory cardiovascular and inflammatory protein biomarkers to identify biomarkers associated with five non-communicable diseases: cataract, diabetes, hypertension, myocardial infarction and stroke. To this end, we used the Proximity Extension Assay (PEA) to estimate the abundance of 145 proteins in plasma of individuals from a longitudinal cross-sectional population-based study in Sweden. As described above, we determined the effect of a range of clinical variables and lifestyle factors, and also studied the heritability of each biomarker, and using high-resolution SNP array data coupled with imputation against the 1000 genomes reference panels, we performed a genome-wide association study (GWAS) for each biomarker. Through modelling of the relevant covariates for each of the biomarkers we were able to develop personally normalized plasma protein profiles (PNPPP), adjusted to take into account the individual genetic, clinical and lifestyle factors that affected the level of a biomarker within a subject. The adjusted values for each biomarker in the control population were compared with the values from individuals in the population cohort affected by each of the non-communicable diseases.
Material and Methods
Samples
Participants for the present study were selected from the NSPHS, and samples were taken and processed as described in Example 1.
Non-communicable diseases
The NSPHS cohort represents a cross-section of the inhabitants in the north rural areas of Sweden and thus participants suffering from any non-communicable disease were not excluded from the study. The endpoints in this study are the self-reported diseases/conditions of Cataract, Diabetes (both type I and II), Myocardial infarction, High blood pressure and Stroke. The frequencies of these and baseline anthropometrics are reported in Table 8. The overlap between the individuals self-reporting multiple diseases are shown in Table 10.
Multiplexed proximity extension assay
Protein levels in plasma were analysed using the Olink Proseek Multiplex CVD 1 96x96 kit and quantified by real-time PCR using the Fluidigm BioMark™ HD real-time PCR platform as described above. Additional data from Example 1 was also included in this study, and the biomarkers studied are listed in Tables 4 and 6. Individual samples where at least one of the internal controls contained an outlier value were also excluded from further analyses (total n=7 of 976, 0.7%). We wanted at least 200 observations per protein above detection limit in order to conduct the downstream statistical analyses and therefore proteins with fewer observations were excluded from further analyses. Five proteins; PTX3 (Pentaxin-related protein PTX3), ITGB1 BP2 (Integrin beta-1-binding protein 2), PAPPA (Pappalysin-1 ), NT- pro-BNP (N-terminal pro-B-type natriuretic peptide) and BNP (Natriuretic peptides B) had fewer than 200 observations above their detection limit. These five proteins were not included in the analysis of influencing factors of neither the variance nor the genetic associations. Uniprot recommended short names have been used throughout when these are available otherwise; the assay manufacturer's abbreviations have been used. All assay characteristics including detection limits and measurements of assay performance and validations are available from the manufacturer's webpage.
Genetic data
The KA06 and KA09 cohorts have been genotyped as described in Example 1. The two cohorts were imputed separately as outlined above. The input data was phased chromosome-wise using SHAPEIT (v2.r727). The reference panel used was the autosomal 1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) (National Center for Biotechnology Information build b37, Dec 2013) accessed from the IMPUTE Web resource. The resulting data was filtered as outlined above. The final dataset included 8'506'190 SNPs and INDELs.
ABO blood group assignment
We assigned blood groups as described in Example 1 .
Statistical analyses
All statistical analysis was conducted in R as outlined above. Significance levels for influencing variables were calculated separately for each measured protein by fitting a generalized linear model using the 'glm' function including all covariates simultaneously. The significance of the each covariate's contribution to the total variance was estimated as described above. Covariates were considered significant for a specific protein if their Bonferroni-adjusted p-values were below 0.05 (p-value < 3.14 x 10"4, 0.05/159). Each PEA measurement was individually adjusted for significant covariates and rank-transformed to normality by using the 'rntransform' function available from the R-package GenABEL (V1 .8.0).
Strict Bonferroni-adjusted p-values (p-value < 5.88 x 10"9, 0.05/ 8'506'190) were used to report significance in the discovery cohort and the replication cohort (p-value < 0.05 / number of significant SNPs in the discovery cohort).
Normalized plasma profiles
For each of the non-communicable diseases we split the individuals into cases and controls. We then re-examined the significance levels for influencing variables for each measured protein in the controls alone by fitting a generalized linear model using the 'glm' function including all covariates simultaneously. In this step the posterior genotypic probabilities for the top-ranking marker from the combined protein specific GWAS was also included as a possible confounder. The per-protein generated linear model was then applied to each individual in the cases as well as the controls and the resulting residuals were kept for further analysis.
Results
Plasma protein profile normalization
The abundance of 145 individual proteins was measured in blood plasma from 976 individuals from the Northern Sweden Population Health Study (NSPHS) using PEA followed by quantification using qPCR. This represents data on the proteins studied in Examples 1 and 2. Of the 145 proteins, 125 had detectable levels in at least 80% of the individuals. Using information on a large set of covariates (n=159) measured in the NSPHS; individual clinical variables, lifestyle factors, medication and genetic variants, significantly influencing per-protein levels were determined as outlined in the above Examples. The significant covariates per-protein were identified. For 94 of the 125 proteins we identified significant epidemiological associations. The heritability of protein abundance levels was estimated by evaluating the co-segregation of protein levels with the relatedness among individuals, using a polygenic model (see Methods for details). For 98 of the proteins the levels were significantly heritable (Bonferroni adjusted p-value < 0.05), with heritability estimates ranging from 0.19 to 0.74. The GWAS's yielded genome-wide significant hits for 10 of the proteins unique to this study; 9 of these represent previously unreported associations whilst we replicate the IL1R1 loci previously reported for ST2. The top genetic associations for the proteins unique to this study are listed in Table 7. Including the genetic hits described in Example 1 and shown in Table 3, 24 of 125 proteins with detectable levels had genome- wide significant hits. 2970 hits for all 24 proteins were identified, which included annotations and overlaps with previously reported associations according to the NHGRI's GWAS catalogue.
To search for biomarker candidates we focused on five non-communicable diseases present in our study cohort (Table 8), and for each disease used the unaffected individuals as controls to identify covariates with a significant effect on the abundance level for each of the 125 proteins (see Methods for details). We then constructed a per-protein linear model based upon these covariates, enabling us to calculate personally normalized plasma protein profiles (PNPPP) both for the affected and unaffected individuals. Case-control analyses were subsequently performed using unadjusted biomarker measurements and PNPPP. For all 145 proteins we also recorded the number of measurements that were below the detection limit in the cases and controls, and examined statistically proteins that had a significantly larger proportion of measurements above detection limit in the cases as compared to controls.
Biomarkers for individual diseases
Cataract
Cataract is defined as either congenital or age-related with the latter being the most common with over 75% of the cases. Apart from age, prolonged steroid treatment or exposure to sunlight contribute to disease progress. Using unadjusted levels, 37 proteins showed significant differences between patients and controls, while PNPPP identified 8 proteins with increased levels compared to controls, and one protein with lower levels (Table 9). Six of the 9 proteins were adjusted for one or more significant covariate. MMP-12 (matrix metallopeptidase 12) was associated with higher levels in the cases and affected by 5 factors; age, systolic blood pressure (SBP), waist circumference, genetic factors (Table 7) and length. EGFR (Epidermal growth factor receptor) was the only protein that had significantly lower levels in cases compared to controls. Previous studies have associated siRNA mediated knockdown of EGFR in human lens epithelial cells reduces the risk of posterior capsular opacity after cataract surgery. In addition, two proteins, BNP (p-value 3.8 x 10"14) and REG-4 (p-value 1.8 x 10"7), were found to have a significantly higher number of observations above the detection limit in cases as compared to the controls (Table 9). BNP has previously been found in the aqueous humour of the eye and in higher amounts in patients with cataract than with glaucoma.
Diabetes
Diabetes type II is associated with age and lifestyle factors, such as weight (obesity), diet and exercise. This is consistent with the diabetes cases in our cohort, which are older and have a higher BMI than the controls (Table 8). Using unadjusted levels, 22 proteins showed significant differences between patients and controls, while use of PNPPP identified 4 proteins with higher abundance levels in cases than controls (Table 9). MMP-10 (matrix metallopeptidase 10) only showed a significant difference between cases and controls using the PNPPP methodology (with sex and smoking as significantly influencing factors in the controls), but not in the untransformed data. Higher levels of MMP-10 have previously been associated with microvascular complications such as nephropathy in type I diabetes. GDF- 15 also showed higher levels in cases compared to controls. GDF-15 levels were significantly associated with length, weight, waist circumference, SBP, age, sample round, pregnancy and bile acid preparations (A05AA). GDF-15 has been proposed as a biomarker for cardiovascular disorders and has been shown to be correlated with a multitude of metabolic and anthropometric parameters including age, waist circumference, mean arterial pressure, fasting glucose, and fasting insulin in an obese cohort. The same study also found significantly increased levels of plasma GDF-15 in obese individuals compared to a control group and specifically in obese individuals with type II diabetes compared to obese individuals with normal glucose tolerance.
High blood pressure
High blood pressure, or primary hypertension, augments the risks of stroke, heart attacks, heart failure, chronic kidney diseases (CKD) and lower-limb circulation problems. Blood pressure can be lowered through medication of by changes in lifestyle; exercise, reduction of tobacco use and alcohol consumption, losing weight, changing food habits and reducing stress. Other factors such as sex, age, as well as genetic variants also impact blood pressure. Using unadjusted abundance levels, 63 proteins showed significant differences between patients and controls, while use of PNPPP identified 32 proteins with higher abundance levels in cases than controls and 2 with lower levels (Table 9). The strongest association with hypertension was found for SPON1 (Spondin 1 ) (p-value < 8.7 x 10"18), with no additional significant covariates. SPON1 has been proposed as a candidate gene for hypertension in hypertensive rats based on gene expression studies. Renin (REN) also showed significantly higher levels in cases after correcting for sex and medication (proton pump inhibitors (ATC:A02BC) and Digitalis glycosides (ATC:C01AA)). For six proteins there was a significantly higher number of observations above the detection limit in the cases as compared to controls (Table 9). Among these proteins were BNP (p- value 2.8 x 10"20) and NT-pro-BNP (p-value 4.9 x 10"11), both previously shown to be associated with risk of hypertension. Kidney failure may also result in elevated BNP and NT- pro-BNP levels and a strong indicator of this condition is blood creatine levels. However, in our cohort there was no difference (p > 0.05, Wilcoxon rank sum test) in the frequency of elevated levels between individuals with BNP or NT-pro-BNP above, or those below detection level.
Myocardial infarction
Acute myocardial infarction (AMI) is caused by rupture of atherosclerotic plaques which leads to formation of a plug that blocks the blood supply to the myocardium. Several proteins have been proposed as predictive or diagnostic biomarkers of AMI, with Troponin being the most established for both prognosis and diagnosis. Using unadjusted levels, 32 proteins showed significant differences between patients and controls, while use of PNPPP identified 12 proteins with higher abundance levels in cases than controls and 2 with lower levels (Table 9). The endpoint in our study is having had AMI at any previous time point, and we therefore do not necessarily expect to find an overlap with diagnostic or prognostic biomarkers. However, GDF-15 (Growth Differentiation Factor 15) levels were increased, as often seen in response to injury or stress. We also detect higher levels of Midkine in the cases. High levels of Midkine have been associated with improved long-term survival after myocardial infarction and its therapeutic properties in relation to AMI has recently been reviewed. Two proteins only showed significant differences between cases and controls using PNPPP, MMP-10 and FGF-23 (Fibroblast growth factor 23). Higher levels of FGF-23 have been associated with mortality and cardiovascular events in patients with chronic kidney disease. FGF-23 regulates serum phosphate levels where high levels of serum phosphates triggers FGF-23 production, and elevated serum phosphate levels is common in patients after AMI and is associated with poorer prognostics. Three proteins; BNP (p-value 1 .9 x 10"21), NT-pro-BNP BNP (p-value 3.3 x 10"7) and REG-4 (p-value 1 .9 x 10"7) were found to have significantly more observations above detection limit in the cases compared to the controls (Table 3). Both BNP and NT-pro- are known to be strong indicators of AMI.
Stroke
Biomarkers for stroke are scarce and often focus on the diagnosis, prediction of severity and therapy selection of ischemic stroke, although some efforts have been made to search for biomarkers differentiating between ischemic and haemorrhagic stroke. Using unadjusted abundance levels, 20 proteins showed significant differences between patients and controls, while use of PNPPP identified 6 proteins, all with higher abundance levels in cases than controls (Table 9). Two proteins were found only using the PNPPP: PIGF (Placental growth factor) and CXCL13 (C-X-C motif chemokine 13). Out of the proteins detected here, IL-6 (Interleukin 6) levels in plasma have previously been related to infarct volume, severity and early neurological deterioration. mRNA-levels of Amphiregulin (coded by the AREG gene) and CXCL13 (CXCL13) have shown associations with ischemic stroke. Increased AREG expression has also been observed in ischemic stroke with hemorrhagic transformation and increased transcription of CXCL13 has been seen in response to ischemia. For one protein, BNP (Table 9), there was a significantly (p-value 2.1 x 10"8) higher fraction of observations above detection limit in the cases relative to the controls. BNP has previously been shown to have elevated levels after ischemic stroke.
The PNPPP method for biomarker identification
Using the unadjusted abundance values we found a large number of epidemiological associations in proteins with significant case-control difference (Figure 4, Raw). The PNPPP method reduced the number of covariates for proteins with case-control difference radically (Figure 4, PNPPP). The average number of covariate associations per significant protein in the raw analyses ranged from 1.5 to 2.6 when restricted to age, sex, SBP, waist, weight and smoking. Using all 159 covariates investigated here, the average number of associations for the raw analysis ranged from 2.7 to 4.2. The same numbers for the PNPPP where 0.65 to 1 .8 for the restricted set of covariates and 1.6 to 2.8 for the whole set. Employing the PNPPP also resulted in a dramatic reduction of the number of protein candidate biomarkers showing a difference between cases and controls. Considering all five diseases, the number of significant differences in abundance between cases and controls was reduced from 174 when using unadjusted (raw) levels to 64 when using the PNPPP method. The effect of using PNPPP varied between diseases, with the most striking effect seen in Cataract and Stroke (Figure 5). In general, the association remaining using PNPPP also showed the largest effect size.
We identified between 4 and 34 biomarkers for each non-communicable disease using PNPPP (Table 9). For three of the five diseases we found proteins with lower levels in the cases compared to the controls, but for the vast majority of comparisons the levels in the cases were higher than in the controls (Table 9, Figure 5). The effect of using PNPPP in determining of suitable cutoffs for a disease is illustrated for four proteins in Figure 6. Tissue plasminogen activator (t-PA, Figure 6A) has previously been associated with increased risk of Myocardial Infarction. Recombinant t-PA is an approved drug (ATC: B01AD) used for early treatment of ischaemic stroke. None of the individuals in our cohort is using this drug. Our covariate analysis showed that both weight and SBP influence circulating levels of t-PA in individuals that have not suffered a myocardial infarction. The association between the abundance level of t-PA and weight in the unadjusted raw values (Figure 6A, dotted line), prohibits the use of a weight-independent cut-off for identifying cases. However, using PNPPP (Figure 6A, solid lines) a constant cut-off could be applied. A second example is the abundance levels of TIM (hepatitis A virus cellular receptor 1 ), that differed significantly between Cataract cases and controls, but does not differ after normalization for weight, SBP, age, waist, length, usage of Insulins and analogues for injection, fast-acting (ATC: A10AB) and genetic factors (Table 7) (Figure 6B). TIM is currently used as a biomarker for proximal tubular injury in renal diseases but has not been linked to cataract, suggesting that the associations seen here using the raw values were due primarily to differences in age, SBP and weight between cataract cases and controls. Similar patterns were seen for GDF-15 (Figure 6C) in relation to Diabetes and SBP and for Growth Hormone in relation to Hypertension and weight (Figure 6D). Growth Hormone levels in plasma are known to be lower in obese individuals compared to individuals with normal weight. Here, we have adjusted the growth hormone signals for sex, length, weight and adherence to a traditional lifestyle. In both these cases, the PNPPP allows for linear cut-offs to be applied.
Biomarker overlap between diseases
The diseases examined are all relatively common and a number of the individuals carry diagnoses for several of the diseases. The fraction of cases only diagnosed with one disease varies from 16% for Stroke to 60% for Hypertension (Table 10). The substantial fraction of individuals with multiple diagnoses implies that some biomarkers could be shared between disease groups. Indeed, among the proteins identified by PNPPP analysis there are examples of such cross-sharing biomarkers. The small number of individuals remaining when requiring no overlaps between end-points reduces the statistical power to detect proteins with case-control differences, nevertheless, for Cataract and Hypertension 4 and 23 proteins out of the 9 and 33 proteins originally found with significant case-control differences retain their significance when restricting to single end-points (Table 9).
Discussion
We have developed a methodology for analysis of protein biomarkers for major non- communicable diseases that have advantages both for identification of novel protein biomarkers, prioritization of candidates for clinical validation and for the clinical application of biomarker analyses. Our approach is based on modelling the effect of genetic, anthropometric, clinical and lifestyle covariates on biomarker levels in non-affected individuals, and derives personally normalized plasma protein profiles (PNPPP). The PNPPP approach was able to replicate associations with previously known biomarkers for these diseases, as well as identify several novel biomarker candidates. This is despite the relatively small size of the cohort used and only moderate prevalence of the diseases studied, attesting to the usefulness of the approach.
In searching for novel biomarkers, the PNPPP procedure provides advantages by limiting the number of covariates included in the analysis and providing a set of protein candidate biomarkers for further validation whose variability is less affected by factors unrelated to disease. An inherent complication to the study of common diseases is that individuals may belong to several of the endpoint categories, reflecting the fact that especially elderly individuals are diagnosed with multiple diseases. This is partly addressed by incorporating the use of medications in the models where any effect of a medication for a partially overlapping disease would be accounted for in both the cases and the controls. We also tried to address this by including only individuals with a single diagnosis in the analysis, providing further focus on biomarkers with a disease-specific association. In our cohort, this is a harsh limiting factor and reduces statistical power. We still however, retain the statistical significance for 70% of the detected proteins for Hypertension and 44% of the proteins for Cataract. For the other 3 diseases, the remaining number of individuals is below 25 and no proteins retained statistical significance.
Finally, the PNPPP procedure can aid in the clinical application of protein biomarkers. For many non-communicable diseases, anthropometrics and lifestyle related variables are strong risk factors. One such example is age or SBP for cataract. SBP has a p- value (7.7 x 10"13) in parity with the best discriminating protein using raw values (TIM-1 , 1 .6 x 10"15), and much lower than the best protein using the PNPPP values (EGFR, 5.4 x 10" 08). From Figure 6B (grey bars) it is clear that the cataract frequency increases with SBP up to a certain point, but also that none of the SBP-groups have more than 20% cataract incidence. This illustrates that proteins that are strongly correlated with anthropometrics or non-disease related lifestyle factors, need to have their levels normalized in order to be clinically informative. Our results indicate that such normalization improves the utility of the biomarkers and facilitates their clinical interpretation. This is evident from the fact that 75% (94 of 125) of the proteins with measurable levels display significant epidemiological associations, and 19 % (24 of 125) have genome-wide significant genetic associations. These results alone signal a paradigm shift in how biomarker studies could, and should, be designed. Information on which parameters that affects a biomarker can be retrieved from a non-affected cohort and the same parameters are then used to normalize the values in the affected cohort. This differs conceptually from recruiting an e.g. age and gender matched controls and allows for a more efficient use, and re-use, of control cohorts. The results will also impact on how the biomarkers are used clinically. Either the physician will set a cut-off depending on a predefined set of prerequisites, such as age, gender or ethnicity, or use a computer aid to recalculate the value based on models generated from non-affected individuals. The former system quickly becomes unfeasible when several factors need to be accounted for or when non-categorical variables such as age or weight are use.
We have shown the effects of genetic, anthropometric, clinical and lifestyle covariates on circulating protein levels and introduce a methodology, PNPPP, for performing unbiased normalization of the protein levels. We have exemplified with the application of five non-communicable diseases and single out both known and novel biomarkers with significant case-control differences. We propose, that the PNPPP procedure provide a major advantage for prioritization among biomarker candidates in that it identifies the most promising biomarker candidates for clinical validation from the point of being less affected by variation unrelated to the disease state.
Tables
Figure imgf000051_0001
Table 1 List of significant covariates. Direction of correlations was calculated using PEA- values without any additional covariate correction having been carried out. A10AD (Insulins and analogues for injection, intermediate-acting combined with fast-acting), C03DA
(Aldosterone antagonists), C08CA (Dihydropyridine derivatives), N05CD (Benzodiazepine derivatives), N05CF (Benzodiazepine related drugs), R03BA (Glucocorticoids), R03AC (Selective beta-2-adrenoreceptor agonists), R03DC (Leukotriene receptor antagonists)
Figure imgf000052_0001
Table 2 GWAS results.1 Heritability estimate. 2 Fraction of variance explained in the adjusted and transformed phenotype by the top-ranking SNP (SNP with lowest p-value in the combined analysis). 3 Estimation of the inflation factor for the resulting distribution of p-values. P-values 5 were calculated from 1 df Wald statistics chi-square values.
Figure imgf000053_0001
Table 3 Location and annotation of top GWAS hits. 1 1n hg19 coordinates. 2 Independent genome wide significant loci as per conditional analysis on top-snp, reported p-values after conditioning on the top-hit. 3 In perfect LD (R2 =1 ) with top-snp in our cohort. 4 rs1 1465293 was not discovered in the unconditional discovery-replication analysis and was subsequently rerun in a conditional discovery-replication analysis resulting in the p-values 3.9 x 10"9 and 6.7 x 10"5 for the discovery and replication cohorts respectively. 5 Imputed. 6 When the top SNP(s) was imputed, the top ranking genotyped SNP was also included in the table. P-values were calculated from 1 df Wald statistics chi-square values.
Table 4. A list of the 92 proteins quantified by PEA in Example 1 .
Protein Information Gene Information
ShortName FullName Uniprot Uniprot name HGNC Ensembl
ADM Adrenomedullin P35318 ADML HUMAN ADM ENSG00000148926
AR Amphiregulin P15514 AREG HUMAN AREG ENSG00000109321
BAFF B-cell activating factor Q9Y275 TN13B HUMAN TNFSF13B ENSG00000102524
BTC Betacellulin P35070 BTC HUMAN BTC ENSG00000174808
CA-125 Ovarian cancer-related tumor marker Q8WXI7 MUC16JHUMAN MUC16 ENSG00000181 143
125
CA242 CA242 tumor marker, Cancer Antigen
242
CAIX Carbonic Anhydrase IX Q16790 CAH9 HUMAN CA9 ENSG00000107159
CASP-3 Caspase-3 P42574 CASP3 HUMAN CASP3 ENSG00000164305
CCL21 C-C motif chemokine 21 000585 CCL21 HUMAN CCL21 ENSG00000137077
CCL24 C-C motif chemokine 24 000175 CCL24 HUMAN CCL24 ENSG00000106178
CCL19 C-C motif chemokine 19 Q99731 CCL19 HUMAN CCL19 ENSG00000172724
CD30-L Tumor necrosis factor ligand P32971 TNFL8_HUMAN TNFSF8 ENSG00000106952 superfamily member 8
CD40-L CD40 ligand P29965 CD40L HUMAN CD40LG ENSG00000102245
CD62E E-selectin P16581 LYAM2 HUMAN SELE ENSG00000007908
CD69 Early activation antigen CD69 Q07108 CD69 HUMAN CD69 ENSG000001 10848
CEA Carcinoembryonic antigen P06731 CEAM5 HUMAN CEACAM5 ENSG00000105388
CPI-B Cystatin B P04080 CYTB HUMAN CSTB ENSG00000160213
CSF-1 Macrophage colony-stimulating factor 1 P09603 CSF1 HUMAN CSF1 ENSG00000184371
CTSD Cathepsin D P07339 CATD HUMAN CTSD ENSG000001 17984
CXCL10 C-X-C motif chemokine 10 P02778 CXL10 HUMAN CXCL10 ENSG00000169245
CXCL1 1 C-X-C motif chemokine 1 1 014625 CXL1 1 HUMAN CXCL1 1 ENSG00000169248
CXCL13 C-X-C motif chemokine 13 043927 CXL13 HUMAN CXCL13 ENSG00000156234
CXCL5 C-X-C motif chemokine 5 P42830 CXCL5 HUMAN CXCL5 ENSG00000163735
CXCL9 C-X-C motif chemokine 9 Q07325 CXCL9 HUMAN CXCL9 ENSG00000138755
EGF Epidermal growth factor P01 133 EGF HUMAN EGF ENSG00000138798
EGFR Epidermal growth factor receptor P00533 EGFR HUMAN EGFR ENSG00000146648
EMMPRIN Extracellular matrix metalloproteinase P35613 BASI HUMAN BSG ENSG00000172270
inducer
Ep-CAM Epithelial cell adhesion molecule P 16422 EPCAM HUMAN EPCAM ENSG000001 19888
EPO Erythropoietin P01588 EPO HUMAN EPO ENSG00000130427
EPR Epiregulin 014944 EREG HUMAN EREG ENSG00000124882
ER Estrogen receptor P03372 ESR1 HUMAN ESR1 ENSG00000091831
ErbB2 Receptor tyrosine-protein kinase ErbB-2 P04626 ERBB2 HUMAN ERBB2 ENSG00000141736
ErbB3 Receptor tyrosine-protein kinase ErbB-3 P21860 ERBB3 HUMAN ERBB3 ENSG00000065361
ErbB4 Receptor tyrosine-protein kinase ErbB-4 Q15303 ERBB4 HUMAN ERRB4 ENSG00000178568
FABP4 Fatty acid binding protein 4 adipocyte P15090 FABP4 HUMAN FABP4 ENSG00000170323
FAS Tumor necrosis factor receptor P25445 TNR6JHUMAN FAS ENSG00000026103 superfamily member 6
FasL Fas antigen ligand P48023 TNFL6 HUMAN FASLG ENSG000001 17560
Flt3L Fms-related tyrosine kinase 3 ligand P49771 FLT3L HUMAN FLT3LG ENSG00000090554
FR-alpha Folate receptor alpha P15328 F0LR1 HUMAN F0LR1 ENSG000001 10195
FS Follistatin P19883 FST HUMAN FST ENSG00000134363
Gal-3 Galectin-3 P17931 LEG3 HUMAN LGALS3 ENSG00000131981
GDF-15 Growth/differentiation factor 15 Q99988 GDF15 HUMAN GDF15 ENSG00000130513
GH Growth Hormone P01241 SOMA HUMAN GH1 ENSG00000259384
GM-CSF Granulocyte-macrophage colony- P04141 CSF2JHUMAN CSF2 ENSG00000164400 stimulating factor
HB-EGF Heparin-binding EGF-like growth factor Q99075 HBEGF HUMAN HBEGF ENSG000001 13070
HE4 Epididymal secretory protein E4 Q14508 WFDC2 HUMAN WFDC2 ENSG00000101443
HGF Hepatocyte growth factor P14210 HGF HUMAN HGF ENSG00000019991
HGF receptor Hepatocyte growth factor receptor P08581 MET HUMAN MET ENSG00000105976 hK1 1 Kallikrein-1 1 Q9UBX7 KLK1 1 HUMAN KLK1 1 ENSG00000167757
IFN-gamma Interferon gamma P01579 IFNG HUMAN IFNG ENSG000001 1 1537
IL-12 Interleukin 12 P29460 IL12B HUMAN IL-12A, IL-12B
IL-1 ra Interleukin 1 receptor antagonist protein P18510 IL1 RA HUMAN IL1 RN ENSG00000136689
IL-2 Interleukin 2 P60568 IL2 HUMAN IL2 ENSG00000109471
IL-4 Interleukin 4 P051 12 IL4 HUMAN IL4 ENSG000001 13520
IL-6 Interleukin 6 P05231 IL6 HUMAN IL6 ENSG00000136244
IL-7 Interleukin 7 P13232 IL7 HUMAN IL7 ENSG00000104432
IL-8 Interleukin 8 P10145 IL8 HUMAN IL8 ENSG00000169429
IL17RB Interleukin 17 receptor B Q9NRM6 I 17RB HUMAN IL17RB ENSG00000056736
IL2RA Interleukin 2 receptor subunit alpha P01589 IL2RA HUMAN IL2RA ENSG00000134460
IL6RA Interleukin 6 receptor subunit alpha P08887 IL6RA HUMAN IL6R ENSG00000160712
KLK6 Kallikrein-6 Q92876 KLK6 HUMAN KLK6 ENSG00000167755
LIGHT Tumor necrosis factor ligand 043557 TNF14_HUMAN TNFSF14 ENSG00000125735 superfamily member 14
MCP-1 Monocyte chemotactic protein-1 P13500 CCL2 HUMAN CCL2 ENSG00000108691
MIA Melanoma-derived growth regulatory Q16674 MIAJHUMAN MIA ENSG00000261857 protein
MIC-A MHC class I polypeptide-related Q29983 MICA_HUMAN MICA ENSG00000204520 sequence A
MK Midkine P21741 MK HUMAN MKD ENSG000001 10492
MMP-3 Matrix metalloproteinase-3 P08254 MMP3 HUMAN MMP3 ENSG00000149968
MPO Myeloperoxidase P05164 PERM HUMAN MPO ENSG00000005381
MYD88 Myeloid differentiation primary response Q99836 MYD88_HUMAN MYD88 ENSG00000172936 protein MyD88
OPG Osteoprotegerin 000300 TR1 1 B HUMAN TNFRSF1 1 B ENSG00000164761
PDGF subunit Platelet-derived growth factor subunit B P01 127 PDGFBJHUMAN PDGFB ENSG0000010031 1 B
PECAM-1 Platelet endothelial cell adhesion P16284 PECA1JHUMAN PECAM1 ENSG00000261371 molecule
PIGF Placenta Growth Factor P49763 PLGF HUMAN PGF ENSG000001 19630
PRL Prolactin P01236 PRL HUMAN PRL ENSG00000172179
PRSS8 Prostasin Q16651 PRSS8 HUMAN PRSS8 ENSG00000052344
PSA Prostate-specific antigen P07288 KLK3 HUMAN KLK3 ENSG00000142515
REG-4 Regenerating islet-derived protein 4 Q9BYZ8 REG4 HUMAN REG4 ENSG00000134193
SCF Stem cell factor P21583 SCF HUMAN KITLG ENSG00000049130
TF Tissue Factor P13726 TF HUMAN F3 ENSG000001 17525
TGB-beta-1 Latency-associated peptide TGF beta 1 P01 137 TGFB1 HUMAN TGFB1 ENSG00000105329
TGF-alpha Transforming growth factor alpha P01 135 TGFA HUMAN TGFA ENSG00000163235
THPO Thrombopoietin P40225 TPO HUMAN THPO ENSG00000090534
TIE2 Angiopoietin-1 receptor Q02763 TIE2 HUMAN TEK ENSG00000120156
TNF Tumor necrosis factor alpha P01375 TNFA HUMAN TNF ENSG00000232810
TNF-R1 Tumor necrosis factor receptor 1 P19438 TNR1A HUMAN TNFRSF1A ENSG00000067182
TNF-R2 Tumor necrosis factor receptor 2 P20333 TNR1 B HUMAN TNFRSF1 B ENSG00000028137
TNFRSF4 Tumor necrosis factor receptor P43489 TNR4JHUMAN TNFRSF4 ENSG00000186827 superfamily member 4
TR-AP Tartrate-resistant acid phosphatase type P13686 PPA5JHUMAN ACP5 ENSG00000102575
5
U-PAR Urokinase plasminogen activator Q03405 UPAR_HUMAN PLAUR ENSG0000001 1422 surface receptor
VEGF-A Vascular endothelial growth factor A P15692 VEGFA HUMAN VEGFA ENSG000001 12715
VEGF-D Vascular endothelial growth factor D 043915 VEGFD HUMAN FIGF ENSG00000165197
VEGFR-2 Vascular endothelial growth factor P35968 VGFR2 HUMAN KDR ENSG00000128052 receptor 2
Table 5. List of genetic variants identified in the present study.
Figure imgf000058_0001
CCL24 chr7 75519260 rs62479398 C T intergenic
CCL24 chr7 75534963 rs56152363 G A intergenic
CCL24 chr7 75544247 rs72553971 A C upstream
CCL24 chr7 75564606 rs62475266 T C intronic
CCL24 chr7 7561 1260 rs41299505 T C intronic
CCL24 chr7 7561 1678 rs41299517 G A intronic
CCL24 chr7 75617223 rs147613710 A G intronic
CCL24 chr7 756201 10 rs139606068 T C intronic
CCL24 chr7 75622854 rs 146941240 c G intronic
CCL24 chr7 75631315 rs142620045 A G intronic
CCL24 chr7 75642483 rs138257623 C A intronic
CCL24 chr7 75680965 rs145803349 A C intronic
CCL24 chr7 75683402 rs1 16839932 A G intronic
CCL24 chr7 75695081 rs3779419 A G intronic
CCL24 chr7 75698389 rs7805762 C G intergenic
CCL24 chr7 75717164 chr7:75717164:D G GA intergenic
CCL24 chr7 75731482 rs141 128591 G C intergenic
CCL24 chr7 7576131 1 rs142919684 T C intergenic
CCL24 chr7 75795218 rs 1 15035619 A G intergenic
CCL24 chr7 75803132 rs4728587 T C intergenic
CCL24 chr7 75841589 rs4728614 T C intronic
CCL24 chr7 75846453 rs10952869 G A intronic
CCL24 chr7 76429959 rs10215877 C T intergenic
CCL24 chr7 76958730 rs 10274222 A G intronic
CCL24 chr7 77026460 rs2286156 A G intronic
CD40-L chrX 135510059 rs76416377 T C intergenic
CD40-L chrX 135598869 rs35896879 T C intergenic
Figure imgf000060_0001
Figure imgf000061_0001
Figure imgf000062_0001
Figure imgf000063_0001
Figure imgf000064_0001
Figure imgf000065_0001
Figure imgf000066_0001
Table 6. A list of the 92 proteins quantified by PEA in Example 2.
Adrenomedullin (AM) Agouti-related protein (AGRP)
Angiopoietin-1 receptor (T I E2) Beta-nerve growth factor (Beta-NGF)
Cathepsin D (CTSD) Caspase-8 (CASP-8)
Cathepsin L1 (CTSL1 ) C-C motif chemokine 20 (CCL20)
C-C motif chemokine 3 (CCL3) C-C motif chemokine 4 (CCL4)
CD40 ligand (CD40L) Chitinase-3-like protein 1 (CHI3LI)
C-X-C motif chemokine 1 (CXCL1 ) C-X-C motif chemokine 6 (CXCL6)
C-X-C motif chemokine 16 (CXCL16) Cystatin-B (CSTB)
Dickkopf-related protein 1 (Dkk-1 ) Endothelial cell-specific molecule 1 (ESM-1 )
Eosinophil cationic protein (ECP) Epidermal growth factor (EGF)
E-selectin (SELE) Fatty acid-binding protein, adipocyte
Fibroblast growth factor 23 (FGF-23) Follistatin (FS)
Fractalkine (CX3CL1 ) Galanin peptides (GAL)
Galectin-3 (Gal-3) Growth hormone (GH)
Growth/differentiation factor 15 (GDF-15) Heat shock 27 kDa protein (HSP 27)
Heparin-binding EGF-like growth factor (HB-
Hepatocyte growth factor (HGF)
EGF)
lnterleukin-1 receptor antagonist protein (IL- lnterleukin-18 (IL-18)
1 ra)
lnterleukin-27 subunit alpha (IL27-A) lnterleukin-4 (IL-4)
lnterleukin-6 receptor subunit alpha (IL- lnterleukin-6 (IL-6)
6RA)
lnterleukin-8 (IL-8) Kallikrein-1 1 (hK1 1 )
Kallikrein-6 (KLK6) Lectin-like oxidized LDL receptor 1 (LOX-1 )
Macrophage colony-stimulating factor 1
Leptin (LEP)
(CSF-1 )
Matrix metalloproteinase-1 (MMP-1 ) Matrix metalloproteinase-10 (MMP-10)
Matrix metalloproteinase-12 (MMP-12) Matrix metalloproteinase-3 (MMP-3)
Matrix metalloproteinase-7 (MMP-7) Melusin (ITGB1 BP2)
Membrane-bound aminopeptidase P (mAmP) Monocyte chemotactic protein 1 (MCP-1 )
Myeloperoxidase (MPO) Myoglobin (MB)
Natriuretic peptides B (BNP) NF-kappa-B essential modulator (NEMO)
N-terminal pro-B-type natriuretic peptide (NT-
Osteoprotegerin (OPG)
pro-BNP) Ovarian cancer-related tumor marker CA 125
Pappalysin-1 (PAPPA)
(CA-125)
Pentraxin-related protein PTX3 (PTX3) Placenta growth factor (PIGF)
Platelet endothelial cell adhesion molecule Platelet-derived growth factor subunit B (PECAM-1 ) (PDGF subunit B)
lnterleukin-16 (IL16) Prolactin (PRL)
Protein S100-A12 (EN-RAGE) Proteinase-activated receptor 1 (PAR-1 )
Proto-oncogene tyrosine-protein kinase Src
P-selectin glycoprotein ligand 1 (PSGL-1 ) (SRC)
Receptor for advanced glycosylation end
Renin (REN)
products (RAGE)
Resistin (RETN) SIR2-like protein (SIRT2)
Spondin-1 (SPON1 ) ST2 protein (ST2)
Stem cell factor (SCF) Thrombomodulin (TM)
TIM-1 (TIM) Tissue factor (TF)
TNF-related activation-induced cytokine
Tissue-type plasminogen activator (t-PA)
(TRANCE)
TNF-related apoptosis-inducing ligand
TNF-related apoptosis-inducing ligand (TRAIL)
receptor 2 (TRAIL-R2)
Tumor necrosis factor ligand superfamily
Tumor necrosis factor receptor 1 (TNF-R1 ) member 14 (TNFSF14)
Tumor necrosis factor receptor superfamily
Tumor necrosis factor receptor 2 (TNF-R2)
member 5 (CD40)
Tumor necrosis factor receptor superfamily Urokinase plasminogen activator surface member 6 (FAS) receptor (U-PAR)
Vascular endothelial growth factor D
Vascular endothelial growth factor A (VEGF-A)
(VEGF-D)
Table 7. Genetic associations (top hits) with abundance of the proteins unique to the study in Example 4.
Figure imgf000069_0002
Table 8. Baseline data for the five non-communicable diseases in the study cohort Example 4.
Figure imgf000069_0001
Table 9. Significant associations of protein abundance levels with disease status identified using the personally normalized plasma protein profiles (PNPPP) methodology in Example 4. Bold faced proteins indicate association only seen using PNPPP.
Figure imgf000070_0001
Protein remains significant when individuals with multiple diseases are removed.
Table 10. Number of individuals overlapping between endpoints. Numbers within parentheses indicate individuals not overlapping with any of the other 4 diseases.
Diabetes Cataract Hypertension Myocardial Stroke
Diabetes 78 (23) 17 45 13 7
Cataract 85 (39) 36 13 5
Hypertension 240 (136) 37 12
Myocardial 67 (18) 12
Stroke 37 (6)

Claims

Claims.
1. A method for determining an individualised normal level of a biomarker for a test
subject for use in analysis of said biomarker in the diagnosis or monitoring of a disease or its treatment in said subject, said method comprising:
(a) determining the level of a biomarker in samples of a body fluid or tissue in a control population free from said disease, to obtain a set of control abundance levels for said biomarker in a said sample;
(b) analysing the control biomarker abundance levels of step (a) with respect to one or more non-disease related phenotypic factors to determine which phenotypic factors have a statistically significant effect on the biomarker abundance levels in said control population thereby to identify phenotypic covariates for said biomarker, and performing a statistical analysis to determine the effect of any such phenotypic covariate(s) identified on the variance of the control abundance levels;
(c) optionally transforming the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, to obtain a normal distribution;
(d) using the normalised residual control abundance level values from step (c) or the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, in a step of statistical analysis of genetic data comprising genetic variants identified in said control population to determine whether any one or more non-disease related genetic covariate(s) have an effect on the abundance levels of said biomarker in said control population;
(e) generating a model which is capable of adjusting an abundance level of a said biomarker in a said sample for the effect of the phenotypic and/or genetic covariates identified in steps (b) and (d);
(f) assessing the phenotype and/or genotype of the test subject with respect to the phenotypic and/or genetic covariates for said biomarker identified in steps (b) and (d) to determine the individual phenotypic and/or genetic covariates for said test subject; and
(g) using the model of step (e) to determine a value for a normal level for the
abundance of the biomarker in a said sample from the said test subject having said individual phenotype and genotype as determined in step (f), thereby to determine an individualised normal level of the biomarker for said test subject.
2. A method of generating a model which is capable of adjusting an abundance level of a biomarker in a sample of a body tissue or fluid for the effect of phenotypic and/or genetic covariates which affect the level of said biomarker in said sample, said method comprising:
(a) determining the level of a biomarker in samples of a body fluid or tissue in a
control population free from said disease, to obtain a set of control abundance levels for said biomarker in a said sample;
(b) analysing the control biomarker abundance levels of step (a) with respect to one or more non-disease related phenotypic factors to determine which phenotypic factors have a statistically significant effect on the biomarker abundance levels in said control population thereby to identify phenotypic covariates for said biomarker, and performing a statistical analysis step to determine the effect of any such phenotypic covariate(s) identified on the variance of the control abundance levels;
(c) optionally rank-normally transforming the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, to obtain a normal distribution;
(d) using the normalised residual control abundance level values from step (c) or the residual abundance level values from step (b), which have been adjusted for the effect of the phenotypic covariates, in a step of statistical analysis of genetic data comprising genetic variants identified in said control population to determine whether any one or more non-disease related genetic covariate(s) have an effect on the abundance levels of said biomarker in said control population; and
(e) generating a model which is capable of adjusting an abundance level of a said biomarker in a said sample for the effect of the phenotypic and/or genetic covariates identified in steps (b) and (d).
A method for determining an individualised normal level of a biomarker for a test subject for use in analysis of said biomarker in the diagnosis or monitoring of a disease or its treatment in said subject, said method comprising:
(a) assessing the phenotype and/or genotype of the test subject with respect to
phenotypic factors and/or genetic variants which have been identified as being phenotypic and/or genetic covariates for the abundance level of the biomarker in a said sample (more particularly, as defined in steps (b) and (d) of claim 2), to determine the individual phenotypic and/or genetic covariates for said test subject;
(b) using the model obtained in claim 2 to determine a value for a normal level for the abundance of the biomarker in a said sample from the said test subject having said individual phenotype and genotype as determined in step (f), thereby to determine an individualised normal level of the biomarker for said test subject.
A method of detecting a biomarker in a test subject, said method comprising:
(a) determining the level of the biomarker in a body fluid or tissue sample of said test subject;
(b) using the model of claim 2 to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and/or genetic covariates on the abundance level of the biomarker in a said sample; and
(c) comparing said adjusted value to an individualised normal level for the marker in a said sample obtained according to the method of claim 1 or claim 3.
A method of diagnosing or monitoring a disease, or the treatment thereof, in a subject, said method comprising detecting the presence of a biomarker in said subject using the method of claim 4.
6. The method of any one of claims 1 to 5, wherein said marker is or comprises a protein.
7. The method of any one of claims 1 to 6, wherein the sample is blood or a blood- derived product, urine, faeces, saliva, joint fluid, cerebrospinal fluid, semen, vaginal secretion, sputum, gastric contents, a tissue biopsy sample, a tissue homogenate, an aspirate, swab or rinsate.
8. The method of claim 7, wherein the sample is blood serum or plasma or a blood
fraction.
9. The method of any one of claims 1 to 8, wherein the phenotypic factors or covariates are selected from anthropomorphic characteristics, clinical parameters, medication, lifestyle factors and/or sample round.
10. The method of claim 9, wherein
(i) the anthropomorphic characteristics are selected from age, gender, and size- related characteristics, including height, weight, hip size, waist size and body- mass index (BMI), and/or any combination or ratio thereof; and/or
(ii) the clinical parameters are selected from blood pressure, blood group, one or more organ function tests, bone density, levels of test analytes, metabolite levels, allergens, and pregnancy or any combination thereof; and/or (iii) the lifestyle factors are selected from smoking, use of recreational drugs, alcohol consumption, occupation, presence of household pets, diet and exercise or any combination thereof.
1 1 . The method any one of claims 1 , 2 or 6 to 10, which method comprises a step of
performing one or more genetic tests to identify any genetic variants present in the control population, wherein said genetic test(s) are performed prior to step (d) of carrying out the statistical analysis of genetic data.
12. The method of any one of claims 1 to 1 1 , wherein said genetic variants comprise
SNPs, deletions, insertions, CNVs, and/or structural variations.
13. The method of any one of claims 1 to 12, wherein the model is capable of adjusting the biomarker abundance level for the effect of both phenotypic and genetic covariates.
14. The method of any one of claims 1 to 13, wherein the phenotype and genotype of the test subject are assessed with respect to the phenotypic and genetic covariates for said biomarker.
15. The method of claim 14, wherein the step of genotyping the test subject comprises sequencing one or more genomic sequences and/or determining the presence of one or more predetermined genetic variants.
16. The method of any one of claims 1 to 15, wherein the method is carried out for or uses a combination of two or more biomarkers.
17. The method of any one of claims 1 to 16, wherein in step (a) the levels of biomarker are determined using an immunoassay.
18. The method of any one or more of claims 1 to 17, wherein step (b) of claims 1 or 2 comprises performing a multiple linear regression analysis.
19. The method of claim 18, wherein in step (a) an anova test is performed.
20. The method of any one of claims 1 to 19, wherein the step of analysing the genetic data (step (d) of claims 1 and 2) includes a genome-wide association study.
21 . The method of any one of claims 1 to 20, wherein the step of analysing the genetic data (step (d) of claims 1 and 2) includes an imputation step.
22. The method of any one of claims 1 to 21 , wherein the genetic data (in step (d) of claims 1 and 2) further comprises data on the heritability of the biomarker.
23. The method of claim 22, wherein step (d) further comprises a step of determining the heritability of the biomarker in the control population.
24. A method of identifying a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject, said method comprising:
(a) determining the level of a candidate biomarker in a body fluid or tissue sample of a subject with said disease;
(b) using the model of claim 2 to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and/or genetic covariates on the abundance level of the biomarker in a said sample;
(c) comparing the adjusted value from (b) to an adjusted value for the abundance level of said biomarker in a sample from a subject free from said disease; and
(d) determining whether there is a difference between said adjusted values (i.e. the adjusted value for the subject with said disease and the adjusted value for the subject free from said disease);
wherein the presence of a difference between said adjusted values identifies the candidate biomarker as a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject.
25. The method of claim 24, wherein the adjusted value for the abundance level of said biomarker in a sample from a subject free from said disease is obtained by:
(a) determining the level of the candidate biomarker in a body fluid or tissue sample of a subject free from said disease; and
(b) using the model of claim 2 to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and/or genetic covariates on the abundance level of the biomarker in a said sample.
26. The method of claim 24 or 25, being a method of identifying a biomarker for use in the diagnosis or monitoring of a disease or its treatment in a subject, said method comprising: (a') determining the level of a candidate biomarker in samples of a body fluid or tissue in a control population free from said disease to obtain a set of control abundance levels for said biomarker in said samples;
(b') using the model of claim 2 to calculate a set of adjusted abundance values for said biomarker, wherein said values are adjusted for the effects of phenotypic and/or genetic covariates on the abundance level of a biomarker in said samples;
(c') determining the level of a candidate biomarker in a body fluid or tissue sample of a subject with said disease;
(d') using the model of claim 2 to calculate an adjusted value for the abundance level of said biomarker in a said sample, wherein said value is adjusted for the effect of phenotypic and genotypic covariates identified in step (b');
(e') comparing said adjusted value of step (d') to the set of adjusted values of step (b') for the biomarker in a sample from a subject free from said disease; and (f) determining whether there is a difference between said adjusted value from the subject with said disease, and the adjusted values from the control population; wherein the presence of a difference between said adjusted values identifies the candidate biomarker as a biomarker for use in the diagnosis or monitoring of said disease or its treatment in a subject.
27. The method of any one of claims 24 to 26, wherein step (a) and/or step (c') is
performed on a population of subjects and adjusted values are obtained for each of the subjects in said population to produce or obtain a set of adjusted values.
28. The method of any one of claims 24 to 27, wherein step (c) and/or step (e') comprises comparing a set of adjusted values from a population of subjects with said disease to a set of adjusted values from a population of subjects free from said disease.
29. The method of any one of claims 24 to 28, wherein step (d) and/or step (f) comprises a step of statistical analysis, preferably wherein the difference is a statistically significant difference.
30. The method of any one of claims 24 to 29, wherein the difference between said
adjusted values or sets of adjusted values is an increase in the abundance level of said biomarker in the subject(s) with said disease.
31 . The method of any one of claims 24 to 30, wherein the difference between said adjusted values or sets of adjusted values is a decrease in the abundance level of said biomarker in the subject(s) with said disease.
32. The method of any one of claims 24 to 31 , wherein the method is carried out for two or more candidate biomarkers.
33. The method of any one of claims 24 to 32, wherein the method, marker, sample,
phenotypic factors or covariates, genetic variants, model and/or phenotype and genotype are as defined in any one of claims 6 to 23.
PCT/EP2015/063698 2014-06-19 2015-06-18 Determination and analysis of biomarkers in clinical samples WO2015193427A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB1410956.5 2014-06-19
GBGB1410956.5A GB201410956D0 (en) 2014-06-19 2014-06-19 Determination and analysis of biomarkers in clinical samples
GB201414913A GB201414913D0 (en) 2014-08-21 2014-08-21 Determination and analysis of biomarkers in clinical samples
GB1414913.2 2014-08-21

Publications (1)

Publication Number Publication Date
WO2015193427A1 true WO2015193427A1 (en) 2015-12-23

Family

ID=53483807

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2015/063698 WO2015193427A1 (en) 2014-06-19 2015-06-18 Determination and analysis of biomarkers in clinical samples

Country Status (1)

Country Link
WO (1) WO2015193427A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9976494B2 (en) 2015-03-27 2018-05-22 Dr. Ing. H.C.F. Porsche Aktiengesellschaft Internal combustion engine
CN110024036A (en) * 2016-11-28 2019-07-16 皇家飞利浦有限公司 The analysis of antibiotics sensitivity is predicted
CN110325106A (en) * 2017-02-20 2019-10-11 加利福尼亚大学董事会 The determination of serology of asymptomatic cerebral ischemia
US11085089B2 (en) 2019-03-01 2021-08-10 Mercy Bioanalytics, Inc. Systems, compositions, and methods for target entity detection
CN113345525A (en) * 2021-06-03 2021-09-03 谱天(天津)生物科技有限公司 Analysis method for reducing influence of covariates on detection result in high-throughput detection
AU2022202798A1 (en) * 2021-05-26 2022-12-15 Genieus Genomics Pty Ltd Processing sequencing data relating to amyotrophic lateral sclerosis

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011143574A2 (en) * 2010-05-14 2011-11-17 The Trustees Of The University Of Pennsylvania Plasma biomarkers for diagnosis of alzheimer's disease

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011143574A2 (en) * 2010-05-14 2011-11-17 The Trustees Of The University Of Pennsylvania Plasma biomarkers for diagnosis of alzheimer's disease

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
NARIMATSU ET AL: "Lewis and secretor gene dosages affect CA19-9 and DU-PAN-2 serum levels in normal individuals and colorectal cancer patients.", CANCER RESEARCH, vol. 58, no. 3, 1 February 1998 (1998-02-01), pages 512 - 518, XP055082070, ISSN: 0008-5472 *
O'CONNOR M F ET AL: "To assess, to control, to exclude: Effects of biobehavioral factors on circulating inflammatory markers", BRAIN, BEHAVIOR AND IMMUNITY, ACADEMIC PRESS, SAN DIEGO, CA, US, vol. 23, no. 7, 1 October 2009 (2009-10-01), pages 887 - 897, XP026626582, ISSN: 0889-1591, [retrieved on 20090421], DOI: 10.1016/J.BBI.2009.04.005 *
PARISI FABIO ET AL: "Benefits of biomarker selection and clinico-pathological covariate inclusion in breast cancer prognostic models", BREAST CANCER RESEARCH, CURRENT SCIENCE, LONDON, GB, vol. 12, no. 5, 1 September 2010 (2010-09-01), pages R66, XP021085378, ISSN: 1465-5411, DOI: 10.1186/BCR2633 *
STEFAN ENROTH ET AL: "Strong effects of genetic and lifestyle factors on biomarker variation and use of personalized cutoffs", NATURE COMMUNICATIONS, vol. 5, 22 August 2014 (2014-08-22), pages 4684, XP055209636, DOI: 10.1038/ncomms5684 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9976494B2 (en) 2015-03-27 2018-05-22 Dr. Ing. H.C.F. Porsche Aktiengesellschaft Internal combustion engine
CN110024036A (en) * 2016-11-28 2019-07-16 皇家飞利浦有限公司 The analysis of antibiotics sensitivity is predicted
CN110024036B (en) * 2016-11-28 2023-06-30 皇家飞利浦有限公司 Analytical prediction of antibiotic susceptibility
CN110325106A (en) * 2017-02-20 2019-10-11 加利福尼亚大学董事会 The determination of serology of asymptomatic cerebral ischemia
US11085089B2 (en) 2019-03-01 2021-08-10 Mercy Bioanalytics, Inc. Systems, compositions, and methods for target entity detection
AU2022202798A1 (en) * 2021-05-26 2022-12-15 Genieus Genomics Pty Ltd Processing sequencing data relating to amyotrophic lateral sclerosis
CN113345525A (en) * 2021-06-03 2021-09-03 谱天(天津)生物科技有限公司 Analysis method for reducing influence of covariates on detection result in high-throughput detection

Similar Documents

Publication Publication Date Title
US20230184760A1 (en) Marker combinations for diagnosing infections and methods of use thereof
WO2015193427A1 (en) Determination and analysis of biomarkers in clinical samples
US11733249B2 (en) Methods and algorithms for aiding in the detection of cancer
Verstockt et al. Oncostatin M is a biomarker of diagnosis, worse disease prognosis, and therapeutic nonresponse in inflammatory bowel disease
JP5789216B2 (en) Methods for providing data for the prognosis of survival time in patients with cancer
Mesko et al. Peripheral blood gene expression patterns discriminate among chronic inflammatory diseases and healthy controls and identify novel targets
Misra et al. Biomarkers in lupus nephritis
US20200300853A1 (en) Biomarkers and methods for measuring and monitoring juvenile idiopathic arthritis activity
Mahboob et al. A novel multiplexed immunoassay identifies CEA, IL-8 and prolactin as prospective markers for Dukes’ stages AD colorectal cancers
CA2750155A1 (en) Serum markers predicting clinical response to anti-tnf.alpha. antibodiesin patients with ankylosing spondylitis
KR20150118107A (en) A method for determining acute respiratory distress syndrome (ards) related biomarkers, a method to monitor the development and treatment of ards in a patient
US20120178100A1 (en) Serum Markers Predicting Clinical Response to Anti-TNF Alpha Antibodies in Patients with Psoriatic Arthritis
Bellocchi et al. Large‐Scale characterization of systemic sclerosis serum protein profile: comparison to peripheral blood cell transcriptome and correlations with Skin/Lung fibrosis
Bourgonje et al. The effect of phenotype and genotype on the plasma proteome in patients with inflammatory bowel disease
Barbarroja et al. Characterization of the inflammatory proteome of synovial fluid from patients with psoriatic arthritis: potential treatment targets
US20140011879A1 (en) Serum markers for identification of cutaneous systemic sclerosis subjects
Wang et al. The implication of long non-coding RNA expression profile in rheumatoid arthritis: Correlation with treatment response to tumor necrosis factor inhibitor
US20220390466A1 (en) Biomarkers of early osteoarthritis
EP4244628A1 (en) Methods for decreasing mortality risk and improving health
Li et al. Proximity extension assay proteomics and renal single cell transcriptomics uncover novel urinary biomarkers for active lupus nephritis
RU2815973C2 (en) Type i interferon mediated disorders
腾飞李 A Novel Model Based on Immune-Related Genes for Differentiating Biliary Atresia from Other Cholestatic Diseases
CN117169515A (en) Markers and systems for predicting prognosis risk of febrile thrombocytopenia syndrome
Chiara Bellocchi et al. Large-scale characterization of systemic sclerosis serum protein profile: Comparison to peripheral blood cell transcriptome and correlations with skin/lung fibrosis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15731023

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15731023

Country of ref document: EP

Kind code of ref document: A1