WO2023084486A1 - Génération d'informations sur l'âge épigénétique - Google Patents

Génération d'informations sur l'âge épigénétique Download PDF

Info

Publication number
WO2023084486A1
WO2023084486A1 PCT/IB2022/060944 IB2022060944W WO2023084486A1 WO 2023084486 A1 WO2023084486 A1 WO 2023084486A1 IB 2022060944 W IB2022060944 W IB 2022060944W WO 2023084486 A1 WO2023084486 A1 WO 2023084486A1
Authority
WO
WIPO (PCT)
Prior art keywords
methylation
loci
cpg sites
sites
dna
Prior art date
Application number
PCT/IB2022/060944
Other languages
English (en)
Inventor
Sandra Ann R. STEYAERT
Geert Trooskens
Wim Maria R. VAN CRIEKINGE
Adriaan VERHELLE
Johan Irma H. VANDERSMISSEN
Original Assignee
H42, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/525,552 external-priority patent/US20230154560A1/en
Priority claimed from US17/831,427 external-priority patent/US11781175B1/en
Application filed by H42, Inc. filed Critical H42, Inc.
Publication of WO2023084486A1 publication Critical patent/WO2023084486A1/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/686Polymerase chain reaction [PCR]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Definitions

  • the technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge-based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks.
  • intelligence i.e., knowledge-based systems, reasoning systems, and knowledge acquisition systems
  • systems for reasoning with uncertainty e.g., fuzzy logic systems
  • adaptive systems e.g., machine learning systems, and artificial neural networks.
  • Genomics in the broad sense, also referred to as functional genomics, aims to characterize the function of every genomic element of an organism by using genome-scale assays such as genome sequencing, transcriptome profiling and proteomics. Genomics arose as a data driven science, and operates by discovering novel properties from explorations of genome-scale data rather than by testing preconceived models and hypotheses. Applications of genomics include finding associations between genotype and phenotype, discovering biomarkers for patient stratification, predicting the function of genes, and charting biochemically active genomic regions such as transcriptional enhancers.
  • Genomics data are too large and too complex to be mined solely by visual investigation of pairwise correlations. Instead, analytical tools are required to support the discovery of unanticipated relationships, to derive novel hypotheses and models and to make predictions.
  • machine learning algorithms are designed to automatically detect patterns in data.
  • machine learning algorithms are suited to data-driven sciences and, in particular, to genomics.
  • the performance of machine learning algorithms can strongly depend on how the data are represented, that is, on how each variable (also called a feature) is computed. For instance, to classify a tumor as malign or benign from a fluorescent microscopy image, a preprocessing algorithm could detect cells, identify the cell type, and generate a list of cell counts for each cell type.
  • Human health and age can be measured in a variety of different ways.
  • a human's chronological age that is a time that an individual is alive, is one form of measure of a human's health and age.
  • Another form of measuring a human's age is a subjective biological age that is used to account for a shortfall between a population average life expectancy and the perceived life expectancy of an individual of the same age.
  • a human's environment or behavior can cause their body to biologically age at an accelerated rate. It has been difficult in the past to estimate an individual 's biological age with accuracy. Additionally, when an individual’s biological age can be estimated, the process to do so is typically accompanied with a high financial cost.
  • the method can be extended using a set of CpG sites methylation observations to generate the epigenetic age prediction.
  • Methylation observations can be generated using an array chip or using polymerase chain reaction (PCR)-based methods. Selection of a limited number of CpG sites facilitates use of faster, less expensive PCR-based methods.
  • the PCR-based methods can begin at home, using a sample extraction test kit, and delivering the sample extracted to a processing facility.
  • the method further includes extracting DNA from the sample and processing the DNA to receive processed DNA.
  • the method further includes amplifying a plurality of loci in the processed DNA to receive amplified DNA and processing the amplified DNA to receive a plurality of methylation values for one or more CpG sites in the plurality of loci.
  • FIG. 1 illustrates a diagram showing an example portion of DNA.
  • FIG. 2 illustrates a molecular structure diagram showing an example methylated cytosine molecule.
  • FIG. 3 illustrates a diagram showing an example portion of methylated DNA.
  • FIG. 4A-4N illustrate diagrams showing genomic diagrams of chromosomes.
  • FIG. 5 illustrates a diagram showing an example array chip used to detect methylation.
  • FIG. 6 illustrates a diagram showing example beads that respond to methylated and unmethylated CpG sites.
  • FIG. 7 illustrates a flow diagram showing an example operation of training a model.
  • FIGS. 8-9 illustrate flow diagrams showing example operations of predicting an epigenetic age.
  • FIG. 10 illustrates a block diagram showing an example computing system.
  • FIG. 11 illustrates a flow diagram showing an example operation of processing a sample to provide an epigenetic age to a user.
  • FIGS. 12A-12H illustrate diagrams showing example amplifications of portions of DNA.
  • FIG. 13 illustrates a flow diagram showing an example operation of acquiring a sample from a user.
  • FIG. 14 illustrates a flow diagram showing an example operation of training a model.
  • FIGS. 15-16 report coefficients associated with the 186 CpG sites listed in FIG. 9.
  • FIG. 15-16 report coefficients associated with the 186 CpG sites listed in FIG. 9.
  • FIG. 15 lists the CpG sites that have positive coefficients, ordered by magnitude.
  • FIG. 16 lists the CpG sites that have negative coefficients, also ordered by magnitude.
  • FIG. 17 illustrates a distribution of methylation observations for cgOl 748572 from individuals of varying ages. With aging, methylation of this site decreases.
  • FIG. 18 is a scatter plot produced using an epigenetic clock having the 42 CpG site observations listed in FIG. 8, showing subjects’ calendar ages versus their calculated epigenetic ages.
  • FIG. 19 reports calculated standard deviations, sigma values associated with the 186 CpG sites listed in FIG. 9.
  • Epigenetics is the study of changes in gene expression that are not the result of changes in the DNA sequence itself.
  • Some examples of processes that are in the field of epigenetics include acetylation, phosphorylation, ubiquitylation, and methylation.
  • Methylation specifically, as we have found can correlate with a number of different diseases, conditions, the health of an individual as well as the biological age of the individual. Methylation can occur at several million different places in the human genome. Methylation can also differ from tissue to tissue and even cell to cell in the human body.
  • a machine learning model can take the estimated cell counts, which are examples of handcrafted features, as input features to classify a tumor.
  • a central issue is that classification performance depends heavily on the quality and the relevance of these features. For example, relevant visual features such as cell morphology, distances between cells or localization within an organ are not captured in cell counts, and this incomplete representation of the data may reduce classification accuracy.
  • Deep learning a subdiscipline of machine learning, addresses this issue by embedding the computation of features into the machine learning model itself to yield end-to-end models.
  • This outcome has been realized through the development of deep neural networks, machine learning models that comprise successive elementary operations, which compute increasingly more complex features by taking the results of preceding operations as input.
  • Deep neural networks are able to improve prediction accuracy by discovering relevant features of high complexity, such as the cell morphology and spatial organization of cells in the above example.
  • the construction and training of deep neural networks have been enabled by the explosion of data, algorithmic advances, and substantial increases in computational capacity, particularly through the use of graphical processing units (GPUs).
  • GPUs graphical processing units
  • the goal of supervised learning is to obtain a model that takes features as input and returns a prediction for a so-called target variable.
  • An example of a supervised learning problem is one that predicts whether an intron is spliced out or not (the target) given features on the RNA such as the presence or absence of the canonical splice site sequence, the location of the splicing branchpoint or intron length.
  • Training a machine learning model refers to learning its parameters, which commonly involves minimizing a loss function on training data with the aim of making accurate predictions on unseen data.
  • the input data can be represented as a table with multiple columns, or features, each of which contains numerical or categorical data that are potentially useful for making predictions.
  • Some input data are naturally represented as features in a table (such as temperature or time), whereas other input data need to be first transformed (such as deoxyribonucleic acid (DNA) sequence into k-mer counts) using a process called feature extraction to fit a tabular representation.
  • the presence or absence of the canonical splice site sequence, the location of the splicing branchpoint and the intron length can be preprocessed features collected in a tabular format.
  • Tabular data are standard for a wide range of supervised machine learning models, ranging from simple linear models, such as logistic regression, to more flexible nonlinear models, such as neural networks and many others.
  • Logistic regression is a binary classifier, that is, a supervised learning model that predicts a binary target variable. Specifically, logistic regression predicts the probability of the positive class by computing a weighted sum of the input features mapped to the [0, 1] interval using the sigmoid function, a type of activation function. The parameters of logistic regression, or other linear classifiers that use different activation functions, are the weights in the weighted sum. Linear classifiers fail when the classes, for instance, that of an intron spliced out or not, cannot be well discriminated with a weighted sum of input features. To improve predictive performance, new input features can be manually added by transforming or combining existing features in new ways, for example, by taking powers or pairwise products.
  • Neural networks use hidden layers to learn these nonlinear feature transformations automatically.
  • Each hidden layer can be thought of as multiple linear models with their output transformed by a nonlinear activation function, such as the sigmoid function or the more popular rectified-linear unit (ReLU).
  • a nonlinear activation function such as the sigmoid function or the more popular rectified-linear unit (ReLU).
  • ReLU rectified-linear unit
  • Deep neural networks use many hidden layers, and a layer is said to be fully-connected when each neuron receives inputs from all neurons of the preceding layer.
  • Neural networks are commonly trained using stochastic gradient descent, an algorithm suited to training models on very large data sets. Implementation of neural networks using modern deep learning frameworks enables rapid prototyping with different architectures and data sets.
  • Fully-connected neural networks can be used for a number of genomics applications, which include predicting epigenetic age from methylation of CpG sites; predicting the percentage of exons spliced in for a given sequence from sequence features such as the presence of binding motifs of splice factors or sequence conservation; prioritizing potential disease-causing genetic variants; and predicting cis-regulatory elements in a given genomic region using features such as chromatin marks, gene expression and evolutionary conservation.
  • a convolutional layer is a special form of fully- connected layer in which the same fully- connected layer is applied locally, for example, in a 6 bp window, to all sequence positions. This approach can also be viewed as scanning the sequence using multiple PWMs, for example, for transcription factors GATAI and TAL I . By using the same model parameters across positions, the total number of parameters is drastically reduced, and the network is able to detect a motif at positions not seen during training.
  • Each convolutional layer scans the sequence with several filters by producing a scalar value at every position, which quantifies the match between the filter and the sequence.
  • a nonlinear activation function commonly ReLU
  • a pooling operation is applied, which aggregates the activations in contiguous bins across the positional axis, commonly taking the maximal or average activation for each channel. Pooling reduces the effective sequence length and coarsens the signal.
  • the subsequent convolutional layer composes the output of the previous layer and is able to detect whether a GATAI motif and TALI motif were present at some distance range.
  • the output of the convolutional layers can be used as input to a fully-connected neural network to perform the final prediction task.
  • different types of neural network layers e.g., fully-connected layers and convolutional layers
  • Convolutional neural networks can predict various molecular phenotypes on the basis of DNA sequence alone. Applications include weighting methylation of CpG sites to calculate an epigenetic age, classifying transcription factor binding sites and predicting molecular phenotypes such as chromatin features, DNA contact maps, DNA methylation, gene expression, translation efficiency, RBP binding, and microRNA (miRNA) targets. In addition to predicting molecular phenotypes from the sequence, convolutional neural networks can be applied to more technical tasks traditionally addressed by handcrafted bioinformatics pipelines.
  • convolutional neural networks can predict the specificity of guide RNA, denoise ChlP — seq, enhance Hi-C data resolution, predict the laboratory of origin from DNA sequences and call genetic variants.
  • Convolutional neural networks have also been employed to model long-range dependencies in the genome. Although interacting regulatory elements may be distantly located on the unfolded linear DNA sequence, these elements are often proximal in the actual 3D chromatin conformation. Hence, modelling molecular phenotypes from the linear DNA sequence, albeit a crude approximation of the chromatin, can be improved by allowing for long-range dependencies and allowing the model to implicitly learn aspects of the 3D organization, such as promoter — enhancer looping.
  • Dilated convolutions which have a receptive field of up to 32 kb.
  • Dilated convolutions also allow splice sites to be predicted from sequence using a receptive field of 10 kb, thereby enabling the integration of genetic sequence across distances as long as typical human introns (See Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535—548 (2019)).
  • Recurrent neural networks are an alternative to convolutional neural networks for processing sequential data, such as DNA sequences or time series, that implement a different parameter-sharing scheme.
  • Recurrent neural networks apply the same operation to each sequence element. The operation takes as input the memory of the previous sequence element and the new input. It updates the memory and optionally emits an output, which is either passed on to subsequent layers or is directly used as model predictions.
  • recurrent neural networks are invariant to the position index in the processed sequence. For example, a recurrent neural network can detect an open reading frame in a DNA sequence regardless of the position in the sequence. This task requires the recognition of a certain series of inputs, such as the start codon followed by an in-frame stop codon.
  • recurrent neural networks over convolutional neural networks are, in theory, able to carry over information through infinitely long sequences via memory. Furthermore, recurrent neural networks can naturally process sequences of widely varying length, such as mRNA sequences. However, convolutional neural networks combined with various tricks (such as dilated convolutions) can reach comparable or even better performances than recurrent neural networks on sequence-modelling tasks, such as audio synthesis and machine translation. Recurrent neural networks can aggregate the outputs of convolutional neural networks for predicting single-cell DNA methylation states, RBP binding, transcription factor binding, and DNA accessibility. Moreover, because recurrent neural networks apply a sequential operation, they cannot be easily parallelized and are hence much slower to compute than convolutional neural networks.
  • Each human has a unique genetic code, though a large portion of the human genetic code is common for all humans.
  • a human genetic code may include an outlier, called a genetic variant, that may be common among individuals of a relatively small group of the human population.
  • a particular human protein may comprise a specific sequence of amino acids, whereas a variant of that protein may differ by one amino acid in the otherwise same specific sequence.
  • each human has a unique state of methylation at CpG sites and in tissues throughout the body.
  • Genetic variants may be pathogenetic, leading to diseases. Though most of such genetic variants have been depleted from genomes by natural selection, an ability to identify which genetic variants are likely to be pathogenic can help researchers focus on these genetic variants to gain an understanding of the corresponding diseases and their diagnostics, treatments, or cures. The clinical interpretation of millions of human genetic variants remains unclear. Some of the most frequent pathogenic variants are single nucleotide missense mutations that change the amino acid of a protein. However, not all missense mutations are pathogenic. Similarly, methylation at CpG sites can reduce, increase or not affect calculated epigenetic age and aging in general.
  • Models that can predict molecular phenotypes directly from biological sequences can be used as in silico perturbation tools to probe the associations between genetic variation and phenotypic variation and have emerged as new methods for quantitative trait loci identification and variant prioritization.
  • These approaches are of major importance given that the majority of variants identified by genome-wide association studies of complex phenotypes are non-coding, which makes it challenging to estimate their effects and contribution to phenotypes.
  • linkage disequilibrium results in blocks of variants being co-inherited, which creates difficulties in pinpointing individual causal variants.
  • sequence-based deep learning models that can be used as interrogation tools for assessing the impact of such variants offer a promising approach to find potential drivers of complex phenotypes.
  • One example includes predicting the effect of noncoding single-nucleotide variants and short insertions or deletions (indels) indirectly from the difference between two variants in terms of transcription factor binding, chromatin accessibility or gene expression predictions.
  • Another example includes predicting novel splice site creation from sequence or quantitative effects of genetic variants on splicing.
  • FIG. 1 is a diagram showing an example portion 100 of DNA.
  • Portion 100 includes strand 102 and strand 104.
  • Nucleotide base pairs 106 extend along the length of strands 102 and 104.
  • An area where a cytosine nucleotide 106-1 is followed by a guanine nucleotide 106-2 is called a CpG site 108.
  • CpG sites 108 are also defined in that only one phosphate group 110 separates the cytosine and guanine nucleotide base pairs 106..
  • FIG. 2 is a molecular structure diagram showing an example 5-Methylcytosine molecule 200.
  • 5-Methylcytosine is an example of a methylated form of cytosine.
  • Molecule 200 includes a cytosine portion 202 and a methyl group portion 204. As shown, the methyl group portion 204 is attached to the 5th atom in the 6-atom ring, counting counterclockwise from the NH-bonded nitrogen at the six o'clock position of the cytosine portion 202.
  • FIG. 3 is a diagram showing an example portion 300 of methylated DNA.
  • Portion 300 includes two CpG sites 308-1 and 308-2 (collectively referred to as CpG sites 308).
  • the first CpG site 308-1 has two methyl groups 310 attached to the two cytosine molecules. This configuration of CpG site 308-1 is called being fully methylated.
  • the second CpG site 308-2 has a single methyl group 310 attached to one of the two cytosine molecules. This configuration of CpG site 308-2 is called being hemi-methylated. When there are no methyl groups attached to either cytosine, then the CpG site is considered to be unmethylated.
  • FIG. 4A-4N illustrate genomic diagrams of chromosomes.
  • Epigenetics by definition, can alter the way in which a gene behaves based on non-sequence changing factors like methylation. If key aging gene behaviors are modified by methylation, then the methylation status of CpG sites in these genes may provide insight on the biological age of an individual.
  • FIG. 4A illustrates a genomic diagram 402 of chromosome 11.
  • FGF19 Fibroblast growth Factor 19
  • FGF family members possess broad mitogenic and cell survival activities and are involved in a variety of biological processes, including embryonic development cell growth, morphogenesis, tissue repair, tumor growth and invasion.
  • the FGF 19 protein is produced in the gut where it functions as a hormone, regulating bile acid synthesis, with effects on glucose, cholesterol, and lipid metabolism. Reduced synthesis, and blood levels, has been linked to chronic bile acid diarrhea and as well as certain metabolic diseases.
  • FGF 19 may be used in treatment of metabolic disease.
  • FGF19 plays a role in maintaining health and metabolic homeostasis.
  • FGF 19 can have hypertrophic effect on skeletal muscle. FGF 19, therefore, appears be a therapeutic target to limit aging-associated muscle loss and other diseases characterized by muscle atrophy (obesity, cancer, kidney failure).
  • the CpG site, Cg27330757, is a probe mapping to protein coding Fibroblast growth Factor 19 (FGFI 9) gene.
  • ADP-ribosylation is a (reversible) post-translational protein modification controlling major cellular and biological processes, including DNA damage repair, cell proliferation and differentiation, metabolism, stress, and immune responses.
  • MACRODI appears to primarily be a mitochondrial protein and is highly expressed in skeletal muscle (a tissue with high mitochondrial content).
  • DGKZ Diacylglycerol Kinase Zeta
  • the latter product activates mammalian target of rapamycin complex I or mechanistic target of rapamycin complex I (m TORC 1).
  • the overall effect ofmTORCl activation is upregulation of anabolic pathways. Downregulation of m TORC I has been shown to drastically increase lifespan.
  • the CpG site, cg00530720, is a probe mapping to the promoter of the DGKZ gene.
  • CD248 Molecule (CD248) gene.
  • the cpG site, cg06419846, is a probe mapping to the CD248 gene.
  • FIG. 4B illustrates a genomic diagram 406 of chromosome 15.
  • WHAMMP3 WAS Protein Homolog Associated with Actin, Golgi Membranes and Microtubules Pseudogene 3
  • This pseudogene has been associated with Prader-Willi syndrome, a severe developmental disorder. This syndrome is caused by epigenetics defect on chromosome 15. More specifically the absence of paternally expressed imprinted genes at I Sal 1 ,2-q 13, paternal deletions of this region, maternal uniparental dysomy of chromosome 15 or an imprinting defect. Multiple imprinted genes in this region contribute to the complete phenotype of Prader-Willi.
  • the CpG site, cg04777312, is a probe mapping to the pseudogene WHAMMP3.
  • PML gene At location 424 is the PML gene.
  • the phosphoprotein coded by this gene localizes to nuclear bodies where it functions as a transcription factor and tumor suppressor. Expression is cellcycle related and it regulates the p53 response to oncogenic signals.
  • the gene is often involved in the PML-RARA translocation between chromosomes 15 and 17, a key event in acute promyelocytic leukemia (APL).
  • APL acute promyelocytic leukemia
  • PML-nuclear body (PML-NBs) interaction is still under further investigation.
  • Current consensus is that PML-NBs are structures which are involved in processing cell damages and DNA-double strand break repairs. Interestingly, these PML-NBs bodies have been shown to decrease with age and their stress response also declines with age.
  • the latter can be in a p53 dependent or independent way.
  • PIVIL has also been implicated in cellular senescence, particularly its induction and acts as a modulator of the Werner syndrome, a type of progeria.
  • the CpG site, cg05697231, is a probe mapping to the south shore of a CpG island in the PML gene.
  • ADAM Metallopeptidase with Thrombospondin Type I Motif 17 (AD AMTS 17) gene.
  • the CpG site, cg07394446, is a probe mapping to ADAMTS17.
  • SMAD6 SMAD Family Member 6
  • FIG. 4C illustrates a genomic diagram 410 of chromosome 2.
  • BIN 1 Bridging Integrator I
  • the BIN I gene provides instructions for making a protein that is found in tissues throughout the body, where it interacts with a variety of other proteins.
  • the BIN I protein is involved in endocytosis as well as apoptosis, inflammation, and calcium homeostasis.
  • the BIN I protein may act as a tumor suppressor protein, preventing cells from growing and dividing too rapidly or in an uncontrolled way.
  • BIN I risk factor for late-onset Alzheimer's disease. While the exact pathogenic mechanism of BIN I is still unknown, both high-risk variants and DNA methylation have been suggested as mechanisms affecting BIN I transcription and Alzheimer's Disease risk.
  • the CpG site, cg27405400, is a probe mapping to the BIN I gene.
  • PTPRN Protein Tyrosine Phosphatase Receptor Type N
  • This gene codes for a protein receptor involved in a multitude of processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation. More specifically, this PTPRN plays a significant role in the signal transduction of multiple hormone pathways (neurotransmitters, insulin and pituitary hormones). PTPRN expression levels are also used as a prognostic tool for hepatocellular carcinoma (negative outcome correlation).
  • the CpG site, cg03545227 is a probe mapping in the protein tyrosine phosphatase receptor type N (PTPRN) gene.
  • FIG. 4D illustrates a genomic diagram 414 of chromosome 19.
  • KLF2 Kruppel-Like Factor 2
  • This gene encodes for a transcription factor (zinc finger protein) found in many different cell types. Expression starts early in mammalian development and plays a role in processes ranging from adipogenesis, embryonic erythropoiesis, epithelial integrity, inflammation, and t-cell viability. Its role in inflammation is of specific interest as chronic systemic inflammation highly correlates with aging and age associated disease.
  • the downstream effect of KLF2 expression is the downregulation of inflammation and reduction of pro-inflammatory activity of nuclear factor kappa beta (NF-kB).
  • NF-kB nuclear factor kappa beta
  • the CpG site cg26842024, is a probe mapping to a CpG island in the Kruppel-Like Factor 2 (KLF2) gene.
  • MIDN Midnolin
  • FIG. 4E illustrates a genomic diagram 430 of chromosome 1 .
  • MEGF6 Multiple EGF Like Domains 6
  • This gene plays a role in cell adhesion, motility and proliferation and is also involved in apoptotic cell phagocytosis. Mutations in this gene are associated with a predisposition to osteoporosis.
  • the CpG site, cg23686029, is a probe mapping in the MEGF6 gene.
  • FIG. 4F illustrates a genomic diagram 434 of chromosome 7.
  • KLF 14 Kruppel-Like Factor 14
  • This gene encodes a member of the Kruppel-like family of transcription factors and shows maternal monoallelic expression in a wide variety of tissues.
  • the encoded protein functions as a transcriptional co-repressor and is induced by transforming growth factor-beta (TGF-beta) to repress TGF-beta receptor II gene expression.
  • TGF-beta transforming growth factor-beta
  • This gene exhibits imprinted expression from the maternal allele in embryonic and extra-embryonic tissues. Variations near this transcription factor are highly associated with coronary artery disease.
  • the CpG site, cg08097417, is a probe mapping to the KLF14 gene.
  • the CpG site, cgl 8691434, is a probe mapping to the STAG3/GPC2 gene.
  • the CpG site, cgl 8691434, is a probe mapping to the STAG3/GPC2 gene.
  • the Huntingtin [0064] Interacting Protein I (HIPI) gene.
  • the CpG site, cgl3702357, is a probe mapping to the fflPI gene.
  • DPY 19L2P4 DPY 19L2 Pseudogene 4
  • the CpG site, cg22370005, is a probe mapping to the DPYI 9L2P4 gene.
  • FIG. 4G illustrates a genomic diagram 438 of chromosome 6.
  • PPP IRI 8 Protein Phosphatase I Regulatory Subunit 18
  • the protein that is encoded by this gene, phosphatase-1 (PPI) plays a role in glucose metabolism in the liver by controlling the activity of phosphorylase a that breaks down glycogen to release glucose in the blood stream.
  • PPI is also involved in diverse, essential cellular processes such as cell cycle progression, protein synthesis, muscle contraction, carbohydrate metabolism, transcription and neuronal signaling. In Alzheimer's disease, expression of PPI is significantly reduced in both white and grey matter.
  • the CpG site, cg23197007 is a probe that maps to the S-shore of the PPPIR18 gene.
  • TMEM181 gene At location 452 is the TMEM181 gene.
  • ZBTB 12 Zinc Finger and BTB Domain Containing 12
  • FIG. 4H illustrates a genomic diagram 442 of chromosome 17.
  • DCXR dicarbonyl and L-xylulose reductase
  • One of its functions is to perform a chemical reaction that converts a sugar called L-xylulose to a molecule called xylitol. This reaction is one step in a process by which the body can use sugars for energy.
  • L-xylulose reductase There are two versions of L-xylulose reductase in the body, known as the major isoform and the minor isoform.
  • the DCXR gene provides instructions for making the major isoform, which converts L-xylulose more efficiently than the minor isoform.
  • the DCXR protein is also one of several proteins that get attached to the surface of sperm cells as they mature. DCXR is involved in the interaction of a sperm cell with an egg cell during fertilization.
  • the CpG site, cg07073120, is a probe that maps to the promotor region of the DCXR gene.
  • FIG. 41 illustrates a genomic diagram 446 of chromosome 22.
  • At location 448 is the BCR Activator of RhoGEF and GTPase (BCR) gene.
  • the CpG site, cg04028010, is a probe mapping to the BCR gene.
  • FIG. 4J illustrates a genomic diagram 462 of chromosome 3.
  • MYLK Myosin Light Chain Kinase
  • This gene codes for a myosin light chain kinase (MLCK), which is a calcium/calmodulin dependent enzyme active in smooth muscle tissue (involuntary muscle). It phosphorylates myosin light chains to facilitate their interaction with actin to produce muscle contraction.
  • MLCK myosin light chain kinase
  • a second function of the MLCK protein is regulation of the epithelial tight junction. These are the gaps between the epithelial cells and their size is of major biological importance as this determines the selective permeability of the epithelial barrier.
  • telokin This small protein is identical to the c-terminus of the Myosin Light Chain Kinase (MYLK) protein and helps stabilize unphosphorylated myosin filaments. Abnormal expression of MYLK have been observed in many inflammatory diseases such as, pancreatitis, asthma, inflammatory bowel disease.
  • TGM4 Transglutaminase 4
  • SFMBTI Scm Like With Four Mbt Domains 1
  • NUDT16P Nudix Hydrolase 16
  • FIG. 4K illustrates a genomic diagram 482 of chromosome 12.
  • the CpG site, cgl3663218, is a probe mapping to the LOC283392/TRHDE gene.
  • FIG. 4L illustrates a genomic diagram 496 of chromosome 8.
  • NEFM Neurofilament Medium Chain
  • the CpG site, cg07502389, is a probe mapping to the NEFM gene.
  • FIG. 4M illustrates a genomic diagram 518 of chromosome 5.
  • RNF180 Ring Finger Protein 180
  • the RNF180 gene codes for RING-Type E3 Ubiquitin [0076] Transferase.
  • RING- type E3s are implicated as tumor suppressors, oncogenes, and mediators of endocytosis, and play critical roles in complex multi-step processes such as DNA repair and activation of NF -KB a master regulator of inflammation.
  • RING-type E3s and their substrates are implicated in a wide variety of human diseases ranging from viral infections to neurodegenerative disorders to cancer.
  • the CpG site, cg23008153 is a probe mapping to the N- shore of the RNFI 80 gene.
  • FIG. 4N illustrates a genomic diagram 474 of chromosome 18.
  • SMAD2 SMAD Family Member 2
  • the CpG site, cgl 7243289, is a probe mapping to the SMAD2 gene.
  • An array chip can evaluate 100,000 or 450,000 or even 850,000 CpG sites in parallel. The number of sites evaluated by one chip will increase in time. To evaluate so many sites requires on the order of 500-1000ng DNA/sample. For blood, this typically involves a blood draw into a test tube.
  • the array chip process costs hundreds of dollars per sample and a significant equipment cost. Completion of a chip array processing run can take more than 100 hours elapsed time.
  • the PCR-based methods in contrast evaluate a handful of sites, tens or hundreds of sites, at a cost approaching a couple of dollars per sample, typically from 5-1 Ong DNA/sample.
  • FIGS. 5-9 relate to an array chip and FIGS. 11-15 relate to PCR- based methods.
  • FIG. 5 is a diagram 582 showing an example array chip 584 used to detect methylation.
  • Array chip 584 includes a silicon wafer 586 which is coated with a photo-resistant material 588.
  • Array chip 584 includes a plurality of microwells 590 disposed along its surface. Microwells 590 are not covered by the photo-resistant material 588 and extend depth-wise into the silicon wafer 586.
  • Each microwell 590 houses one or more beads 608.
  • Beads 608 are coated with multiple copies of an oligonucleotide probe targeting a specific location in the genome. As sample DNA fragments pass over beads 608, each probe binds to a complementary sequence in the sample DNA, stopping proximate the location of interest.
  • FIG. 6 is a diagram showing example beads 608 that respond to methylated and unmethylated CpG sites.
  • sodium bisulfite e.g., 610-1, 610-2, 610-3, and 610-4.
  • Sodium bisulfite converts cytosine into uracil, but leaves methylated cytosine (e.g., 5 -methylcytosine) unaffected.
  • the array chip 584 interrogates these chemically differentiated locations using two site-specific probes, one bead type (U) (beads 608-1, 608-3) presents probes that are designed to match to an unmethylated site; the second bead type (M) (beads 608-2, 608-4) matches a methylated state.
  • Single-base extension of the probes incorporates a labeled plurality of dideoxynucleotides (ddNTP), which is subsequently stained with a fluorescent reagent.
  • the level of methylation for the interrogated location can be determined by calculating the ratio of the fluorescent signals from the methylated vs. unmethylated sites.
  • the locus of interest is unmethylated. It matches perfectly with unmethylated bead probe 608-1, enabling single-base extension and detection.
  • the unmethylated locus has a single-base mismatch to the methylated bead probe 608-2, inhibiting extension that results in a low signal on the array.
  • the CpG locus of interest is methylated, the reverse occurs: the methylated bead 608-4 type will display a signal, and the unmethylated bead 608-3 type will show a low signal on the array. If the locus has an intermediate methylation state, both probes will match the target site and will be extended.
  • Methylation status of the CpG site is determined by a P- value calculation, which is the ratio of the fluorescent signals from the methylated beads to the total locus intensity.
  • the array chip containing the beads 608 can be read by an array scanning device, such as the iScan® System provided by Illumina.Inc or the NextSeq® 550 System provided by Illumina.
  • an array scanning device such as the iScan® System provided by Illumina.Inc or the NextSeq® 550 System provided by Illumina.
  • Various tech companies provide alternative array chips that expose a sample to numerous probes.
  • FIGS. 5-6 disclose one system to detect methylation.
  • the present disclosure also explicitly contemplates using other methods to detect the methylation status of CpG sites.
  • FIG. 7 illustrates a flow diagram showing an example operation 700 of training an epigenetic age predicting model. Operation 700 begins at block 710 where a plurality of methylation profiles from a plurality of individuals are received.
  • a methylation profile contains methylation values for a number of CpG sites of the individual.
  • Methylation values can be in a number of different formats.
  • An example format is a decimal between zero and one (0-value), where zero is fully unmethylated and one is fully methylated.
  • the plurality of individuals’ ages are known. This known age can be used to generate the model.
  • the plurality of methylation values in each profile can be used as an input vector and the known age is the scalar output value.
  • the number of CpG sites in each of the plurality of methylation profiles is m.
  • the quantity m can be a variety of different numbers, as indicated by block 718.
  • m can correspond to a resolution of the methylation analysis method used on the plurality of individuals.
  • Illumina provides methylation microarrays that detect -850,000 CpG sites (Infinium methylation EPIC array) or, for instance, Illumina provides methylation sequencing for 3.3 million or 36 million CpG sites.
  • Operation 700 proceeds at block 720 where the plurality of methylation profiles received in block 710 are normalized or otherwise pre-processed.
  • the methylation profiles can be normalized based on age. For instance, a specific age range may be overrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted less or their chance of being sampled made less likely. In the alternate, a specific age range may be underrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted higher or their chance of being sampled made more likely.
  • the plurality of methylation profiles may be curated based on a quality metric. For example, methylation profiles that have a quality metric lower than a threshold are discarded or weighted less than methylation profiles of a higher quality metric.
  • the quality metric is indicative of the accuracy of the methylation process.
  • the quality metric is indicative of the quality of the sample used to generate the particular methylation profile.
  • the quality metric is indicative of the quality of the test used to generate the particular methylation profile.
  • the quality metric may be indicative of other factors relating the particular methylation profile.
  • the plurality of methylation profiles can be normalized or pre-processed in other ways as well.
  • Operation 700 proceeds at block 730 where a feature selection operation is applied on the plurality of methylation profiles.
  • the feature selection is applied on the plurality of methylation profiles after they are pre-processed.
  • the feature selection operation is applied on the plurality of methylation profiles received in block 710.
  • the feature selection operation applied on the plurality of methylation includes elastic net regression. Elastic net regression combines LI penalties from lasso regression and L2 penalties from ridge regression to reduce the number of CpG sites. [0090] As indicated by block 734, the feature selection applied reduces the number of applicable CpG sites from m to n sites. This reduction can balance dimensionality when there is a small number of methylation profiles, but each profile has a large amount of CpG site methylation values. In some examples, m is greater than 100,000. In some examples, m is greater than 400,000. In some examples, m is greater than 800,000. In some examples, n is less than 200. In some examples, n is less than 100. In some examples, n is less than 50. As indicated by block 738, these numbers can vary from use case-to-use case. A more extensive discussion of ranges of selected n sites appears below in the context of FIG. 19.
  • Operation 700 proceeds at block 740 where a model is fit on the plurality of methylation profiles.
  • the model is fit on the plurality of methylation profiles, but only considers the CpG sites in the subset of n CpG sites.
  • the model is fit on all m CpG sites.
  • the model can include a linear regression model.
  • the model can include a random forest model.
  • the model can include a different type of model as well.
  • Operation 700 proceeds at block 750 where the model is generated.
  • the model is generated as one or more files that can be imported by other systems which can further train the model or use the model for predictions.
  • FIGS. 8-9 are flow diagrams showing example operations of predicting an epigenetic age.
  • Operation 800 begins at block 810 where the input values are received.
  • the shown CpG sites are in order of their feature importance to the model (e.g., in the case of a random forest model) or in order of the absolute value of their coefficient (e.g., in a linear model).
  • the input values are derived from a methylation analysis on a sample of human blood.
  • the inputs are input into the epigenetic age prediction model.
  • this model could include a linear regression model (e.g., where each input forms part of a linear expression), random forest model (e.g., where each input forms part of one or more decision trees), or some other type of model.
  • the model outputs an epigenetic age prediction. In some implementations, the model also outputs a confidence score.
  • the 186 CpG sites identified in FIG. 9 represent a current inventory of CpG methylation sites that contribute significantly to, either increasing or decreasing, epigenetic age. It is a current inventory in the sense of evaluating data available from the commonly used Illumina methylation array that returns values for methylation sites corresponding to probes. In time, Illumina has increased its methylation array from 100,000 to 850,000 probes. In addition, its equipment supports sequencing across millions of methylation sites. In the future, it is likely the more probes will be added to array chips. As the available sampled population broadens, the current inventory of significant sites may grow. The current inventory of 186 significant CpG sites may expand to a future count of 200, 225, 250, 275 or even 300 significant sites.
  • Evaluation of 300 or fewer CpG sites enables practical use of a PCR kit (i.e., design of a primer set corresponding to 300 or fewer CpG sites), as an alternative to using an array of methylation probes.
  • An increase from 10 to 15 PCR steps would enable evaluation of 100 methylation sites.
  • 20 PCR steps would enable evaluation of 150 CpG sites;
  • 25 PCR steps would allow evaluation of 200 CpG sites;
  • 30 PCR steps would enable evaluation of 300 CpG sites.
  • the technology disclosed can be applied to various ranges of CpG sites selected as significant.
  • the range can be 42 to 100, 42 to 200, 42 to 250, 42 to 275, or 42 to 300 CpG sites selected as significant.
  • Selection of significant sites leveraging rank-ordered lists in FIGS. 9-10 or based on coefficients, standard deviations or variances from FIGS. 15-16 and 19 is discussed below. Evaluation of additional sites reported from a probe array remains equivalent to these ranges, especially when the additional sites have relatively small coefficients or variance.
  • Operation 900 begins at block 910 where the input values are received. As shown, there are 186 input values corresponding to the methylation values at 186 different CpG sites. These CpG sites are identified by their CpG cluster identifier number. In some implementations, only a subset of the shown CpG sites is used as inputs. In some implementations, the shown CpG sites are in order of their feature importance to the model (e.g., in the case of a random forest model) or in order of the absolute value of their coefficient (e.g., in a linear model). In one implementation, the input values are derived from a methylation analysis on a sample of human blood.
  • the inputs are input into the epigenetic age prediction model.
  • this model could include a linear regression model (e.g., where each input forms part of a linear expression), random forest model (e.g., where each input forms part of one or more decision trees), or some other type of model.
  • the model outputs an epigenetic age prediction. In some implementations, the model also outputs a confidence score.
  • PCR-based methods have advantages over chip array methods for evaluation of methylation at a limited number of CpG sites, such as 300 or less, as mentioned above.
  • PCR-based method protocols sequencing-based methods, cloning-based methods, methylation /qPCR, HRM PCR, and multiplex PCR, which is the method most extensively discussed herein, and so on.
  • the maximum number of CpG sites detectable using a PCR- based method is limited by the genetic composition of the nucleic acid sequence located upstream and downstream of each respective CpG site. More specifically, maximum number of CpG sites detectable using a PCR-based method is limited by the thermodynamic and kinetic binding characteristics of the nucleic acid sequence located upstream and downstream of each respective CpG site.
  • a pair of primers must be designed, each comprising a respective nucleic acid sequence with complementary base pairing to exactly one strand within a double-stranded DNA sample, such that the pair of primers is said to “flank” the CpG site.
  • a first primer is designed to bind to a first strand within the double-stranded DNA sample at a location upstream of the CpG site and a second primer is designed to bind to a second strand within the double-stranded DNA sample at a location downstream of the CpG site.
  • the pair of designed primers guide the amplification of the DNA sample such that a limited segment of the DNA sample comprising the CpG site is amplified.
  • Complementary base pairing is a chemical reaction characterized by thermodynamic and kinetic characteristics; hence, the efficacy of base pairing between a primer and a strand of DNA (i.e., annealing of the primer) is strongly influenced by temperature and reaction incubation time.
  • the thermodynamic and kinetic characteristics are specific to the nucleic acid composition of the primer as well as the length of the primer. A particular pair of primers will yield the most efficacious base pairing at a particular annealing temperature and particular annealing time.
  • the most efficacious annealing conditions will proportionately result in the highest concentration PCR output or highest quality metric (e.g., absorbance ratios) when all other factors are controlled, as compared to the plurality of all tested annealing conditions for a particular DNA sample and a particular pair of primers.
  • Annealing conditions other than the most efficacious annealing conditions for a particular pair of primers will obtain some degree of decreased efficacy, as measured by the concentration and quality obtained by the PCR reaction.
  • a plurality of primer pairs must be designed wherein there is one primer pair designed per one CpG island. Inherent to the nonoverlapping sequences of the plurality of primer pairs, each particular primer pair within the plurality of primer pairs will produce a nonoverlapping distribution of efficacy measures across varying annealing conditions.
  • Three potential scenarios will be discussed to illustrate the process by which a maximum number of CpG sites detectable for a particular DNA sample using PCR-methods can be determined; however, these scenarios should be explicitly understood to be exemplary scenarios for the purpose of description and should not be considered limitations of the disclosed technology.
  • the first and second scenarios each describe a first primer pair used to detect a first CpG site and a second primer pair used to detect a second CpG site.
  • the first scenario comprises a first primer pair and a second primer pair which are determined to be compatible within a single multiplex PCR reaction by a pre-determined compatibility threshold.
  • the second scenario comprises a first primer pair and a second primer pair which are determined to be incompatible within a single, multitudinal PCR reaction by a pre-determined compatibility threshold.
  • the scenario further comprises a researcher determining a first plurality of efficacy metrics for the first primer pair, wherein a particular efficacy metric is specific to the experimentally determined efficacy metric of a PCR amplification reaction at a particular annealing temperature and a particular annealing time.
  • the researcher also determines a second plurality of efficacy metrics for the second primer pair, wherein a particular efficacy metric is specific to the experimentally determined efficacy metric of a PCR amplification reaction at a particular annealing temperature and a particular annealing time, such that the plurality of particular combinations of annealing temperature and annealing time tested for the first primer pair is identical to the plurality of particular combinations of annealing temperature and annealing time tested for the second primer pair, wherein all additional factors (e.g., number of cycles, quality and concentration of input DNA, volume of each reagent, et cetera) other than the particular primer pair are consistently controlled.
  • additional factors e.g., number of cycles, quality and concentration of input DNA, volume of each reagent, et cetera
  • the researcher may select one of the two primer pairs as a reference for the compatibility threshold, such that the annealing conditions particular to the highest-efficacy PCR reaction using the reference primer pairs is selected for the multiplex PCR protocol to detect a plurality of CpG sites.
  • a pre-determined compatibility threshold may be defined such that the second primer pair is determined to be compatible relative to the first primer pair via a quantitative measure of dissimilarity between the first efficacy metric of the first primer pair at the selected annealing conditions and the second efficacy metric of the second primer pair at the selected annealing conditions.
  • This measure of dissimilarity may be a significance statistic (e.g., a Z-score or t-score with a pre-determined alpha value), an acceptable difference in the magnitude of measured efficacy (e.g., the difference between the final output concentration of the first primer pair and the second primer pair may not exceed 100 nanograms), an acceptable ratio or percentage of relative measured efficacy (e.g., the second efficacy metric value for the second primer pair must have a relative efficacy of at least 90% in reference to the first efficacy metric value for the first primer pair), and so on.
  • a significance statistic e.g., a Z-score or t-score with a pre-determined alpha value
  • an acceptable difference in the magnitude of measured efficacy e.g., the difference between the final output concentration of the first primer pair and the second primer pair may not exceed 100 nanograms
  • an acceptable ratio or percentage of relative measured efficacy e.g., the second efficacy metric value for the second primer pair must
  • the researcher may employ a paired statistical method, such that the first efficacy metric and the second efficacy metric respective to the same annealing conditions are paired together to form one pair within a plurality of pairs for the two observed datasets.
  • Each respective pair for a particular set of annealing conditions can be quantified by variability (i.e., how different are the first efficacy metric and the second efficacy metric) as well as by overall magnitude (i.e., a summary metric summarizing the overall efficacy of the particular set of annealing conditions across the pair of efficacy metrics such as a mean value or median value, wherein the summary metric may be weighted or unweighted).
  • the most efficacious set of annealing conditions may be defined as the annealing conditions wherein the associated pair has the lowest variability or the highest magnitude, for example.
  • a user skilled in the art will recognize that there are a limitless number of methods for the computation of efficacy for a particular efficacy metric pair at a particular set of annealing conditions which can be implemented within the technology disclosed without departing from the spirit or scope of the disclosed technology.
  • An optimization protocol such as a multiple testing protocol may also be used to identify the most efficacious set of annealing conditions across both primer pairs, such as a grid search, random search, Monte Carlo testing, or permutation testing.
  • the multiple testing protocol may be augmented with a form of multiple testing correction, such as Bonferroni’s method or Holm’s method.
  • a pre-determined compatibility threshold may be defined such that a particular multipletesting statistic above a particular threshold or within a particular threshold range corresponds to a set of annealing conditions compatible for the first primer pair and the second primer pair.
  • the compatibility threshold may be defined as the most significant multiple-testing statistic for both efficacy metric pair variability and magnitude or all efficacy metric pairs above a predetermined statistical significance when performing multiple testing for variability and magnitude.
  • the compatibility threshold may be defined as the most significant multiple-testing statistic for both efficacy metric pair variability and magnitude or all efficacy metric pairs above a predetermined statistical significance when performing multiple testing for variability and magnitude.
  • the researcher selects a pre-determined compatibility test, as well as a pre-determined compatibility threshold definition for the compatibility test, such as a method described in the above implementations.
  • a pre-determined compatibility threshold for the selected compatibility test the researcher selects a set of annealing conditions for their multiplex PCR reaction that is determined to be compatible with both the first primer pair and the second primer pair.
  • the selected set of annealing conditions will have a higher efficacy for one of the primer pairs over the other primer pair.
  • the researcher may further decide to compensate for this difference in efficacy by increasing the concentration of the lower-efficacy primer pair to be used within the multiplex PCR reaction.
  • the amplification of the CpG site guided by the lower-efficacy primer pair may overcome its efficacy limitation by increasing the probability of the primer pair coming into contact successfully with the DNA sample.
  • the researcher may choose to perform a series of multiplex PCR reactions at a plurality of primer pair concentrations to determine a particular primer pair concentration as informed by their observed data.
  • the researcher may choose to use one of the multiple guidelines or tools widely available in the literature to determine a particular primer pair concentration, of which a user skilled in the art will be aware or capable of easily obtaining within the literature.
  • the researcher again selects a pre-determined compatibility test, as well as a pre-determined compatibility threshold definition for the compatibility test, such as a method described in the above implementations.
  • a pre-determined compatibility threshold for the selected compatibility test the researcher concludes that their observed data does not provide supporting evidence for a set of annealing conditions that is compatible for both the first primer pair and the second primer pair.
  • the general methodology (as well as many similar methodologies, which will be readily apparent to a user skilled in the art) described within the first scenario and the second scenario may be applied to any number of CpG site-specific primer pairs as a quantifiable and scalable strategy to determine the feasibility of aggregating the detection of a number of CpG sites into a single multiplex PCR reaction.
  • n CpG sites there exists C possible combinations of CpG sites of size k, wherein k can be any number between 1 and C.
  • n choose k formula and referred to in short-hand as C(n,k)
  • the number of combinations of size k can be computed as
  • the total value of C for all possible values of k is equal to the sum of the number of combinations for each individual value of k (i.e., the final C value is equal to ⁇ C(n,k), where i is equal to the total number of possible values for k).
  • a total of 10 CpG sites may be tested via a single possible 10-plex PCR reactions, 10 possible 9-plex PCR reactions, 45 possible 8-plex PCR reactions, 120 possible 7-plex PCR reactions, 210 possible 6-plex PCR reactions, 252 possible 5-plex PCR reactions, 210 possible 4-plex PCR reactions, 120 possible 3- plex PCR reactions, 25 possible 2-plex PCR reactions, and 10 possible 1-plex PCR reactions.
  • the total number of possible PCR reaction combinations in terms of number of sites to be amplified for 10 CpG sites is 1,0003. With sufficient mathematical resources, it is possible to determine which of these reactions are feasible in terms of primer compatibility and primer efficiency. It is likely that a plurality of these reactions are feasible; thus, the technology disclosed may be implemented within a large number of PCR- based protocols to efficiently detect a plurality of CpG sites.
  • the above- described implementations for computing compatibility of distinct primer pairs that flank distinct CpG sites are examples of implementations that may be used to determine the maximum number of CpG sites detectable within a particular DNA sample.
  • N-plex PCR protocol i.e., a multiplex PCR comprising N pairs of primers, wherein a particular pair of primers is leveraged to detect a specific CpG site
  • the researcher has created an N-plex PCR protocol (i.e., a multiplex PCR comprising N pairs of primers, wherein a particular pair of primers is leveraged to detect a specific CpG site) after determining that the N pairs of primers are compatible at the set of annealing conditions to be used within the researcher’s N-plex PCR protocol.
  • the researcher has obtained indicating compatibility of each included primer pair at the implemented set of annealing conditions, the researcher observes that their protocol repeatedly generates a low DNA output yield that is not sufficient for their studies.
  • primer dimers are an interacting set of primers that bind to one-another instead of binding to the DNA sample. Primers within a primer dimer are not able to guide amplification of the DNA sample, therefore reducing the overall DNA output produced by the PCR amplification process. As N increases, the risk of primer dimer formation also increases wherein an N-plex PCR primer set comprising 2N primers (i.e., each pair of primers comprises two individual primers and N describes the total number of individual primers) has C(2N, 2) possible primer dimer interactions.
  • many implementations of the technology disclosed comprise respective methodologies for the determination of a range of detectable CpG sites within a multiplex PCR reaction wherein a particular respective methodology further comprises a quantifiable, concrete number of steps to process one or more features related to the DNA sample, CpG sites within the DNA sample, and primer composition used in a particular multiplex reaction.
  • many implementations of the technology disclosed may be adapted to process any number of features specific to a particular DNA sample, one or more CpG sites, or target experimental conditions to compute a range of possible CpG sites detectable within a single multiplex PCR reaction without departing from the spirit or the scope of the technology disclosed.
  • FIG. 11 illustrates a flow diagram showing an example operation 1100 of processing a sample to provide the epigenetic age to a user.
  • Operation 1100 begins at block 1120 where a sample from the user is received.
  • the sample may be received by a delivery or shipment in which the user generates a sample and prepares it for transit.
  • the sample may be received in a similar manner to the method discussed below with respect to FIG. 13.
  • the user may directly deliver the generated sample to the site of DNA extraction, or to a facility that prepares it for preservation during long-term transit.
  • the sample may be a blood sample. However, it is expressly contemplated that other samples may be received and processed as well.
  • the operation proceeds at block 1130, where DNA is extracted from the sample.
  • the process of DNA extraction includes breaking or lysing, the cells contained in the sample to release the DNA.
  • Cells may be lysed, for example, by using a lysis buffer or other solution suitable for breaking down cellular membranes while retaining DNA quality.
  • the DNA is solubilized and separated from cellular debris. Solubilizing may include, for example, resuspending the lysed DNA in a buffer and/or soluble solution.
  • the cellular debris removed may include proteins, cellular membranes, lipids, or other cellular components that may hinder DNA purity. Separation of the DNA from cellular debris may be done by a variety of methods, including filtration, protein degradation, or any combination thereof.
  • DNA may be extracted manually. However, in other embodiments, a commercially available DNA extraction kit may be utilized as well.
  • Operation 1100 proceeds at block 1140, where quality control is performed on the extracted DNA to receive quality-controlled DNA (QCDNA).
  • Quality control may, for example, include obtaining absorbance measurements of the extracted DNA by ultraviolet-visible (UV-Vis) spectrophotometry 1142. The absorbance measurements may be indicative of extracted DNA quality. Additionally, the measurements provided by UV-Vis spectrophotometry 1142 may be indicative of the concentration of DNA acquired in the extraction process of the sample.
  • quality control may include performing gel electrophoresis on the extracted DNA 1144 to provide an indication of DNA quality. For example, DNA may be observed as it travels down the gel to indicate that the DNA is of an expected size and/or quality.
  • quality control may include combining subsequent extracted DNA aliquots in order to achieve a desired concentration level. Additionally, quality control may include combining the extracted DNA of subsequent samples acquired by the user. In another embodiment, quality control may include diluting the extracted DNA in order to achieve a desired DNA concentration.
  • Operation 1100 proceeds at block 1150 with treating the extracted DNA with sodium bisulfite.
  • Sodium bisulfite converts cytosine into uracil in the extracted DNA sample, but leaves methylated cytosine (e.g., 5-methylcytosine) unaffected. This, in turn, allows for methylated cytosine to be differentiated from unmethylated cytosine in DNA analysis. In this way, the remaining methylated cytosine within the DNA segment of interest will remain detectable and observable, while unmethylated cytosine will not be in a chemical configuration to distort DNA analysis.
  • sodium bisulfite is used in the present example, it is expressly contemplated that other means of treating the extracted DNA may be utilized as well (e.g., enzymebased conversion or other methods).
  • Operation 1100 proceeds at block 1160, where a plurality of loci in the sodium bisulfite treated DNA is amplified.
  • a locus is defined as a specific position on the DNA where a particular gene or DNA sequence is located.
  • the plurality of loci may be positions including one or more CpG sites of interest.
  • DNA amplification involves generating multiple copies of the loci in one or more DNA segments of interest.
  • amplification of the plurality of loci in the sodium bisulfate treated DNA may occur via polymerase chain reaction (PCR).
  • PCR is a technique involving a set of primers, which are short segments of DNA that define the specific DNA sequence of interest and prepare it for amplification, as described above in the context of primer design and primer efficiency.
  • a replication enzyme DNA polymerase
  • the one or more sets of primers serve as a marker for the start and end of a target sequence to be amplified.
  • the one or more sets of primers also serve as an adapter allowing for the attachment of DNA polymerase to the DNA strand, emphasizing the role of primer design processes within the PCR protocol.
  • primer design processes enable the accurate amplification of numerous loci using multiplex PCR. Multiplex PCR is employed in the technology disclosed to allow for the evaluation of a plurality of CpG sites, which will now be further described.
  • amplification of the plurality of loci in the sodium bisulfite treated DNA may also occur by multiplex PCR (MPCR). Whereas standard PCR typically involves the use of one pair of primers, MPCR involves the use of two or more primer pairs in the reaction. In this way, a plurality of loci in the DNA segment of interest may be amplified in a single reaction mixture.
  • the plurality of loci may include, for example, a plurality of CpG sites.
  • the plurality of CpG sites within the DNA segment may be amplified in the same reaction mixture using a plurality of primers corresponding to the CpG sites of interest.
  • multiple MPCR reactions may be utilized in a single reaction mixture, wherein multiple sites of interest may be amplified in multiple segments of DNA within the mixture.
  • different segments of DNA including a plurality of different CpG sites of interest may be amplified via MPCR in a single reaction mixture to generate amplified DNA corresponding to the plurality of CpG sites.
  • a plurality segments may be amplified via MPCR in multiple subsequent reactions such that the sites of interest are amplified while the possibility of cross hybridization and/or mis-priming is minimized.
  • multiple DNA segments of interest such as those including CpG sites, may be amplified in one or few reactions, thus increasing efficiency and preventing the need to undergo a new reaction per site.
  • a plurality of variables influence the range of loci (e.g., CpG sites) capable of being evaluated within a single MPCR reaction. It is to be understood that within the description herein, any description of a range of the minimum or maximum number of CpG sites to be evaluated via the use of MPCR reaction, as well as any other PCR-based method, abides generally by the guidelines and concepts described, or a similar process based on biochemistry-informed primer design processes.
  • the operation proceeds at block 1170, where the amplified DNA is sequenced to generate sequence data.
  • Sequencing is the process of determining the specific order, or sequence, of nucleotide bases within the one or more segments of amplified DNA. Sequencing may occur, for example, using a DNA sequencing system, in which the amplified DNA is processed to emit one or more signals indicative of the chain of nucleotides present within the sample.
  • the extracted DNA is treated with sodium bisulfite prior to amplification.
  • the signals emitted by the sequencing system indicative of cytosine correlate to methylated cytosine (e.g., 5-methylcytosine) rather than unmethylated.
  • the sequence data received may be indicative of the degree of methylation relative to said CpG sites.
  • Quality control is performed on the received sequence data.
  • quality control enables confirmation that no errors have occurred during amplification, such as cross hybridization and/or mis-priming.
  • Quality control may include, for example, a quality check, as indicated by block 1182.
  • the quality check may include ensuring that the one or more segments of amplified DNA are of proper length. Additionally, the quality check may include comparing the one or more segments of amplified DNA with a reference to ensure that the proper DNA segment was amplified, and no random sequencing occurred beyond the desired segment(s) of interest.
  • quality control may further include a confidence check, as indicated by block 1184.
  • the confidence check may include, for example, providing a confidence metric indicative of the error rate between the received sequencing data relative to the one or more amplified sites and a reference. Additionally, it is expressly contemplated that other forms of quality control can occur as well, as indicated by block 1186.
  • sequence data is converted to values indicative of methylation at the CpG sites in the plurality of loci.
  • Sequence data may be converted to methylation values in a number of different formats. For example, one possible format is the conversion of sequence data indicative of methylation at CpG sites of interest to a decimal between zero and one (B- value), as indicated at block 1192, where zero is fully unmethylated and one is fully methylated. Additionally, it is expressly contemplated that the sequence data can be converted to other formats as well, as indicated by block 1194.
  • the PCR protocol utilized to amplify, evaluate, or detect one or more particular CpG sites may be further augmented through the introduction of numerous pre-processing or post-processing steps (e.g., cloning reactions, further chemical modifications to the DNA material such as other probes or small molecules, various sequencing protocols, and evaluation techniques such as restriction enzyme digest, PCR clean up, or purification reactions) as well as augmentation of the PCR amplification reaction itself such as the use of quantitative PCR (qPCR) via the use of fluorescent deoxynucleoside triphosphates (dNTPs), alternative polymerases (such as Taq polymerase versus Pfu polymerase), high resolution melt analysis, reverse transcriptase PCR, and an infinite number of further reagent adjustments to the buffer composition or concentration, addition of stability additives (such as dimethylsulfoxide for GC-rich sequences, bovine serum albumin, or detergents), touchdown (TD) protocols for variable melting temperatures, and so on.
  • qPCR quantitative PCR
  • certain loci amplified and evaluated by the PCR reaction may not contain a CpG site and may instead contain a portion of a housekeeping gene such as GAPDH or actin, or another locus useful as a relative, comparative, or methylation associated loci to be amplified.
  • a plurality of PCR reactions are performed wherein one or more PCR reactions within the plurality of PCR reactions is a M-plex PCR amplification, and one or more PCR reactions within the plurality of PCR reactions is a N-plex PCR amplification, wherein M and N are both positive integers and M and N are not equal to each other.
  • the operation proceeds at block 1200, where an epigenetic age of the user is predicted based on the methylation values.
  • the epigenetic age may be predicted by, for example, inputting the methylation values based on the sequence data into an epigenetic age prediction model, such as model 1512 discussed below with respect to FIG. 15.
  • the methylation values that are input into the model to predict the epigenetic age of the user are based on sequencing data corresponding to the 42 sites indicated in FIG. 15.
  • the methylation values input, corresponding to the amplified CpG sites are in order of their feature importance to the model or in order of the absolute value of their coefficient. Further, in other examples, less methylation values corresponding to less than 42 amplified CpG sites may be input as well.
  • the operation proceeds at block 1210, where the predicated epigenetic age is delivered to the user.
  • the predicated age can be delivered, for example, using a digital graphical user interface (GUI) 1212. Additionally, it is expressly contemplated that the predicted age may be delivered to a user by other means as well, as indicated by block 1214.
  • GUI digital graphical user interface
  • FIGS. 12A-12H illustrate diagrams showing example amplifications of one or more portions of DNA.
  • FIGS. 12A-12H recite similar features and like components are numbered accordingly.
  • DNA portion 1250 illustratively includes DNA strand 1251.
  • DNA strand 1251 comprises a plurality of nucleotide base pairs (not shown), which extend along the length of strand 1251.
  • a cytosine nucleotide followed by a guanine nucleotide e.g., a CpG site
  • strand 1251 may be a DNA segment of interest and have a length of nucleotides relative to the particular DNA segment.
  • DNA portion 1250 further includes a plurality of primer pairs 1254-1 and 1254-2 generally disposed over varying lengths of strand 1251. As illustrated in FIGS. 12A-12H, each primer 1254-1 has a consecutive primer 1254-2, which act as the start and stop marker for amplification. In this way, primers 1254-1 and 1254-2 act as a primer pair. As illustrated, the one or more primer pairs 1254-1 and 1254-2 define amplification site 1252. Amplification site 1252 includes a section of strand 1251 of interest to be amplified. For example, site 1252 may include one or more CpG sites of interest to be amplified.
  • Amplification site 1252 can be of varying lengths, as defined by the placement of primer pairs 1254-1 and 1254-2.
  • the amplification site 1252 shown in FIG. 12G is of a larger length than site 1252 shown in FIG. 12H relative to the placement of primer pairs 1254-1 (1254-2).
  • each strand 1251 can include a plurality of primer pairs 1254-1 and 1254-2, and thus a plurality of amplification sites 1252.
  • FIG. 12A includes 7 primer pairs 1254-1 and 1254-2, and therefore 7 amplification sites 1252.
  • FIG. 12B includes 7 primer pairs 1254-1 and 1254-2, and therefore 7 amplification sites 1252.
  • FIG. 12C includes 5 primer pairs 1254-1 and 1254-2, and therefore 5 amplification sites 1252.
  • FIG. 12D includes 5 primer pairs 1254-1 and 1254-2, and therefore 5 amplification sites 1252.
  • FIG. 12E includes 5 primer pairs 1254-1 and 1254-2, and therefore 5 amplification sites 1252.
  • FIG. 12A includes 7 primer pairs 1254-1 and 1254-2, and therefore 7 amplification sites 1252.
  • FIG. 12B includes 7 primer pairs 1254-1 and 1254-2, and therefore 7 amplification sites 1252.
  • FIG. 12C includes 5 primer pairs 1254-1 and 1254-2, and therefore 5
  • FIGS. 12F includes 5 primer pairs 1254-1 and 1254-2, and therefore 5 amplification sites 1252.
  • FIG. 12G includes 4 primer pairs 1254-1 and 1254-2, and therefore 4 amplification sites 1252.
  • FIG. 12H includes 4 primer pairs 1254-1 and 1254-2, and therefore 4 amplification sites 1252. Additionally, it is expressly contemplated that any of FIGS. 12A-12H could include a different number of primer pairs and therefore a different number of amplification sites as well.
  • FIGS. 12A-12H may be representative of multiple MPCR reactions to amplify a plurality of sites corresponding to a plurality of CpG sites. For example, as shown, 8 separate MPCRs may be completed to amplify a total of 42 sites.
  • the 42 sites may correspond to, for example, the 42 CpG sites set forth below with respect to FIG. 15. Additionally, it is expressly contemplated that less MPCR reactions may be completed to amplify the same 42 CpG sites, or a different number of CpG sites. For instance, a different number of primer pairs 1254-1 (654-2) may be added to a single MPCR to increase the amplification sites while reducing the subsequent MPCRs needed. In another example, less than the 42 listed CpG sites may be desired for epigenetic age determination, in which case a lower number of primer pairs 1254-1 (654-2) may be added relative to the number of CpG sites to be amplified.
  • FIG. 13 illustrates a flow diagram showing an example operation of acquiring a sample from a user.
  • Operation 1300 begins at block 1310 where one or more test kits are delivered to the user.
  • a test kit provides the necessary tools and/or protocols for a user to obtain a sample to be processed.
  • the sample may be obtained in a number of different formats.
  • the sample may be a blood sample obtained by blood draw 1312.
  • Blood draw 1312 may include any device suitable for safely obtaining blood and retaining sample purity.
  • blood draw 1312 may include a non-invasive blood extraction device included in the test kit, in which a user operates the device to obtain the sample.
  • blood draw 1312 may include a vial and syringe.
  • blood draw 1312 may include a fingerstick.
  • other samples may be obtained by the user as well.
  • the sample may be a saliva sample or any other biological sample corresponding to the user.
  • delivering the test kit may include delivering a single kit.
  • the test kit may include multiple test kits.
  • multiple test kits may be included such that the user may obtain multiple samples to ensure a sample of sufficient quality is acquired.
  • delivering the test kit may include delivering a schedule of kits. For example, a plurality of test kits may be scheduled for delivery in specific intervals over time such that a range of samples may be processed to observe the predicted epigenetic age of the user over time.
  • the schedule of test kits may be delivered to a user at one time with instructions on when to generate and prepare the next consecutive sample.
  • the test kit may be some other form of test kit, as indicated by block 1318.
  • Operation 1300 proceeds at block 1320 where a user generates a sample using the one or more test kits.
  • the sample generated by the user may be a blood sample.
  • the sample may be generated using a blood extraction device.
  • the blood sample may be disposed in any vessel suitable for collection.
  • the vessel may be a collection tube provided in the one or more test kits.
  • the vessel may include one or more indicators to indicate to the user when a sufficient volume of sample has been acquired.
  • the indicator may be a fill line on the lower portion of a collection tube that may indicate to a user that enough sample is collected.
  • the fill line may be at the middle or near the top of the collection tube.
  • the collection tube may contain multiple fill lines corresponding to different volumes of the sample to be collected. For example, a first line on collection tube may indicate that enough sample has been collected for DNA extraction, and an additional line above the first line may indicate that an additional sample aliquot has been collected for a consecutive DNA extraction. Additionally, while a blood sample is described as being generated above, it is expressly contemplated that other samples may be generated by the user as well, as indicated by block 1324.
  • Operation 1300 proceeds at block 1330 where the sample is preserved and prepared for transit.
  • Preservation of the sample may include sealing the sample in a vessel suitable for transportation.
  • a sample collected in a collection tube may be sealed with a collection tube cap.
  • preparation of the sample may include freezing the sample prior to transit. Additionally, it is expressly contemplated that other forms of sample preparation can occur as well.
  • Operation 1300 proceeds at block 1340 where the sample is sent for processing.
  • the sample may be sent by a delivery or shipment, as indicated by block 1342, in which the user generates a sample and stores it and any other necessary contents in an appropriate container for transit.
  • the container for shipment may be provided in the one or more test kits delivered to the user.
  • other forms of sample transportation can occur as well, as indicated by block 1344.
  • the user may directly deliver the generated sample to the site where DNA extraction occurs.
  • the user may deliver the sample to a facility that prepares it for preservation during long-term transit.
  • FIG. 14 illustrates a flow diagram showing an example operation 1400 of training an epigenetic age predicting model.
  • Operation 1400 begins at block 1410 where a plurality of methylation profiles from a plurality of individuals are received.
  • a methylation profile contains methylation values for a number of CpG sites of the individual.
  • Methylation values can be in a number of different formats. For example, as indicated above, an example format is a decimal between zero and one (B- value), where zero is fully unmethylated and one is fully methylated.
  • B- value decimal between zero and one
  • These methylation profiles for the individuals can be derived based on the method described above with respect to FIG. 11.
  • the methylation profiles can be obtained by the sequence data acquired via MPCR and sequencing of amplified DNA. .
  • the plurality of individuals' ages are known. This known age can be used to generate the model.
  • the plurality of methylation values in each profile can be used as an input vector and the known age is the scalar output value.
  • the number of CpG sites determined by the sequence data and in each of the plurality of methylation profiles is m.
  • the quantity m can be a variety of different numbers, as indicated by block 1418.
  • m can correspond to a resolution of the methylation analysis method used on the plurality of individuals. For instance, Illumina provides methylation sequencing for 3.3 million or 36 million CpG sites.
  • Operation 1400 proceeds at block 1420 where the plurality of methylation profiles received in block 1410 are normalized or otherwise pre-processed.
  • the methylation profiles can be normalized based on age. For instance, a specific age range may be overrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted less or their chance of being sampled made less likely. In the alternate, a specific age range may be underrepresented in the plurality of profiles, therefore the profiles from this age range may need to be weighted higher or their chance of being sampled made more likely.
  • the methylation profiled can be normalized based on sequence data quality.
  • the methylation values received from this data may need to be weighted less. Additionally, if mispriming occurred resulting in incorrect amplification, the methylation values received from the sequence data relative to the DNA segment may be discarded.
  • the plurality of methylation profiles may be curated based on a quality metric. For example, methylation profiles that have a quality metric lower than a threshold are discarded or weighted less than methylation profiles of a higher quality metric.
  • the quality metric is indicative of the accuracy of amplification of the plurality of loci. In other examples, the quality metric is indicative of the quality of the sample used in amplification.
  • the plurality of methylation profiles can be normalized or pre-processed in other ways as well.
  • Operation 1400 proceeds at block 1430 where a feature selection operation is applied on the plurality of methylation profiles.
  • the feature selection is applied on the plurality of methylation profiles after they are pre-processed.
  • the feature selection operation is applied on the plurality of methylation profiles received in block 1410.
  • the feature selection operation applied on the plurality of methylation profiles includes elastic net regression. Elastic net regression combines LI penalties from lasso regression and L2 penalties from ridge regression to reduce the number of CpG sites.
  • the feature selection applied reduces the number of applicable CpG sites from m to n sites.
  • n is greater than 100,000. In some examples, m is greater than 400,000. In some examples, m is greater than 800,000. In some examples, n is less than 200. In some examples, n is less than 50. For instance, n may correspond to 42 sites. Additionally, as indicated by block 1438, n may correspond to a different number of sites as well.
  • Operation 1400 proceeds at block 1440 where a model is fit on the plurality of methylation profiles.
  • the model is fit on the plurality of methylation profiles, but only considers the CpG sites in the subset of n CpG sites.
  • the model can include a linear regression model.
  • the model can include a random forest model.
  • the model can include a different type of model as well.
  • Operation 1400 proceeds at block 1450 where the model is generated.
  • the model is generated as one or more files that can be imported by other systems which can further train the model or use the model for predictions.
  • the flow diagram in FIG. 8 applies to PCR-based methylation observations as well as chip array observations.
  • the input values may be, for example, the values derived from the PCR analysis indicative of methylation at the CpG sites in the plurality of loci, as noted above with respect to FIG. 11.
  • These CpG sites are identified by their CpG cluster identifier number.
  • only a subset of the shown CpG sites is used as inputs. For example, less than 42 of the listed CpG sites may be input relative to the number CpG sites amplified.
  • the shown CpG sites are in order of their feature importance to the model (e.g., in the case of a random forest model) or in order of the absolute value of their coefficient (e.g., in a linear model).
  • 42 to 100 or 100 to 186 of the sites listed in FIG. 9 can be used.
  • FIGS. 8-9 ordered by the absolute value of the coefficient and the actual coefficients provided in FIGS. 15-16, it will not require undue experimentation for a skilled person to select from the 42 sites listed in FIG. 8 or the 186 sites listed in FIG. 9 sites with which to build an effective epigenetic clock.
  • sites not listed can be combined with the listed sites, either to expand the model or based on newly obtained data, without detracting from the inventors’ contributions to the art.
  • FIGS. 15-16 report coefficients associated with the 186 CpG sites listed in FIG. 9.
  • FIG. 15 lists the CpG sites that have positive coefficients, ordered by magnitude.
  • FIG. 16 lists the CpG sites that have negative coefficients, also ordered by magnitude.
  • the coefficient values vary by about one-thousand fold, from 52.20 to 0.0555246.
  • the range is ten-thousand fold, from -61.66 to -0.0068451. Having these coefficients, one of skill in the art can select among the CpG sites at which to observe methylation and construct an effective epigenetic clock without undue experimentation.
  • FIG. 17 illustrates a distribution of methylation observations for cgOl 748572 from individuals of varying ages. Methylation, as reported either by an array chip or PCR-based methods is not binary, it is a proxy for a faction of DNA methylated at a particular site.
  • the histogram on the right side of FIG. 7 shows the distribution, across all ages, of methylation values observed from population samples evaluated. The dashed line shows that, with aging, methylation at this site tends to decrease.
  • FIG. 18 is a scatter plot produced using an epigenetic clock having the 42 CpG site observations listed in FIG. 8, showing subjects’ calendar ages versus their calculated epigenetic ages. From this plot, residuals can be evaluated by seeing how far a predicted age varies from an actual age. This plot was generated by applying a linear regression model using the coefficients reported in FIGS. 15-16 for the 42 CpG site observations listed in FIG. 8 and intercept of about -45. In this population, the residuals are distributed fairly randomly. The distribution corresponds to how young or old a person, say a 60-year old, is for their age.
  • FIG. 19 reports calculated standard deviations; sigma values associated with the 186 CpG sites listed in FIG. 9, rounded to four decimal places. From the standard deviations, o, variance o 2 is readily calculated. To obtain sufficient detail from the methylation values to generate an epigenetic age prediction, the minimum number of CpG sites needed can be determined by a variety of elimination procedures, such as a variance threshold procedure.
  • coefficient values as previously described corresponding to the methylation values for a respective CpG site may be ranked by standard deviation and the values with the smallest magnitude can be eliminated from the predictive model with values below a particular pre-defined threshold, wherein the standard deviation is used as a proxy metric for the overall contribution of the corresponding methylation values related to a particular CpG site to the epigenetic age prediction model.
  • the minimum number of CpG sites necessary may also be determined via a threshold for a minimum summed value of the top number of coefficients for methylation values corresponding to CpG sites. In either elimination or accumulation, variance can be used instead of standard deviation. In FIG.
  • the 186 standard deviations for methylation values across a population analyzed for the CpG sites as in FIG. 9 have been rank-ordered from largest standard deviation to smallest standard deviation.
  • the sum of the standard deviations for the 36 highest o-ranking CpG sites is 33.7067.
  • the sum of absolute values of coefficients for those 36 sites is 759.8.
  • the sum of the standard deviations for the 42 CpG sites listed in FIG. 8 is 33.5535, rounded to four digits after summation.
  • the sum of absolute values of coefficients for those 42 sites is 858.0.
  • a predetermined threshold can be applied to select sites with the largest coefficients, the largest standard deviations, or the largest variances. Selection of 42 largest coefficient magnitude sites has proven effective, in FIGS. 17-18.
  • a threshold sum of coefficient magnitudes for selected sites can be predetermined, such as a sum of 858 or greater.
  • a threshold sum of standard deviations for selected sites can be predetermined, such as 33.
  • a threshold sum of variances for selected sites can be predetermined, such as 29. Using any of these threshold sums, selection among the 186 sites represented could be made without undue experimentation. Alternatively, a predetermined cutoff threshold could be set.
  • a coefficient smaller than -10.0 or -4.0 or -2.0 or greater than 10.0 or 8.0 or 4.0 could be predetermined to guide selection of sites.
  • a threshold cutoff of standard deviations for selected sites can be predetermined, such as 0.42.
  • a threshold cutoff of variances for selected sites can be predetermined, such as 0.18. Using any of these cutoff thresholds, selection among the 186 sites represented could be made without undue experimentation.
  • a threshold or cutoff it is reasonably inferred at the set threshold for 36, 42, 50, 100, 125, 150, 175, 200, 225, 250, 275 or 300 CpG sites that the selected sites are sufficient to inform an epigenetic age prediction.
  • This threshold value may be a set minimum value, a range of values, or a particular percentile of the overall plurality of CpG sites represented in the dataset corresponding to selection of 36, 42, 50, 100, 125, 150, 175, 200, 225, 250, 275 or 300 CpG sites or a range between any two of these numbers, such as 42-200 CpG sites.
  • FIG. 10 is a computer system 1000 that can be used to implement the convolution based base calling and the compact convolution-based base calling disclosed herein.
  • Computer system 1000 includes at least one central processing unit (CPU) 1072 that communicates with a number of peripheral devices via bus subsystem 1055.
  • peripheral devices can include a storage subsystem 1010 including, for example, memory devices and a file storage subsystem 1036, user interface input devices 1038, user interface output devices 1076, and a network interface subsystem 1074.
  • the input and output devices allow user interaction with computer system 1000 .
  • Network interface subsystem 1074 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
  • the epigenetic age predictor 157 is communicab ly linked to the storage subsystem 1010 and the user interface input devices 1038.
  • Epigenetic age predictor 157 can include one or more models that receive a plurality of inputs and output an epigenetic age.
  • epigenetic age predictor 157 also outputs a confidence score.
  • input encoder 186 pre-processes and/or normalizes inputs before they are fed into a model of epigenetic age predictor.
  • other processing modules 188 can be implemented on the computer system 1000 to execute the technology disclosed.
  • User interface input devices 1038 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
  • pointing devices such as a mouse, trackball, touchpad, or graphics tablet
  • audio input devices such as voice recognition systems and microphones
  • use of the term "input device” is intended to include all possible types of devices and ways to input information into computer system 1000.
  • User interface output devices 1076 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem can also provide a non-visual display such as audio output devices.
  • output device is intended to include all possible types of devices and ways to output information from computer system 1000 to the user or to another machine or computer system.
  • Storage subsystem 1010 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by deep learning processors 1078.
  • Deep learning processors 1078 can include graphics processing units (GPUs), field- programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarsegrained reconfigurable architectures (CGRAs). Deep learning processors 1078 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM .
  • GPUs graphics processing units
  • FPGAs field- programmable gate arrays
  • ASICs application-specific integrated circuits
  • CGRAs coarsegrained reconfigurable architectures
  • Deep learning processors 1078 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM .
  • Examples of deep learning processors 1078 include Google's Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX36 Rackmount SeriesTM ⁇ NVIDIA DGX-I TM, Microsoft' Stratix V FPGATM, Graphcore's Intelligent Processor Unit (IPU)TM Qualcomm's Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA's VoltaTM NVIDIA's DRIVE PXTM NVIDIA's JETSON TX1/TX2 MODULETM, Intel's NirvanaTM ovidius VPUTM Fujitsu DPITM ARM's DynamiclQTMIBM TrueNorthTMand others.
  • TPU Tensor Processing Unit
  • rackmount solutions like GX4 Rackmount SeriesTM, GX36 Rackmount SeriesTM ⁇ NVIDIA DGX-I TM, Microsoft' Stratix V FPGATM, Graphcore's Intelligent Processor Unit (IPU)TM Qualcomm's Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA
  • Memory subsystem 1022 used in the storage subsystem 1010 can include a number of memories including a main random-access memory (RAM) 1032 for storage of instructions and data during program execution and a read-only memory (ROM) 1030 in which fixed instructions are stored.
  • a file storage subsystem 1036 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD- ROM drive, an optical drive, or removable media cartridges.
  • the modules implementing the functionality of certain implementations can be stored by file storage subsystem 1036 in the storage subsystem 1010, or in other machines accessible by the processor.
  • Bus subsystem 1055 provides a mechanism for letting the various components and subsystems of computer system 1000 communicate with each other as intended. Although bus subsystem 1055 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
  • Computer system 1000 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 1000 depicted in FIG. 10 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 1000 are possible having more or less components than the computer system depicted in FIG. 10.
  • a method of obtaining information useful to determine an age of an individual comprises obtaining genomic DNA from blood cells derived from the individual and observing cytosine methylation of cg27330757, cg04777312, cgl3740515, cg07642291 and cg27405400 CG loci designations in the individual’s genomic DNA.
  • the observing comprises performing a bisulfate conversion process on the genomic DNA so that cytosine residues in the genomic DNA are transformed to uracil.
  • the method proceeds with comparing and correlating the CG locus methylation observed in the five identified sites to the CG locus methylation observed in genomic DNA from blood cells derived from a group of individuals of known ages. Information useful to determine an epigenetic age of the individual is obtained.
  • This method implementation and other systems disclosed optionally include one or more of the following features.
  • Methods can also include features described in connection with systems disclosed. In the interest of conciseness, alternative combinations of method features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
  • This method can be extended by observing cytosine methylation of more sites.
  • the additional sites can be at least five or ten or fifteen or twenty CG loci in the genomic DNA selected from the group consisting of CG locus designation: cgl6300556, cg08097417, cg02383785, cg23197007, cg07073120, cg04028010, cg02447229, cg07394446, cg07124372, cg06745229, cgl4283887, cgl l359984, cgl8691434, cgl2112234, cgl7243289, cg03607117, cgl3663218, cgl0091775, cg06540876, cg04903884, cg07502389, cgl3702357, cg22575379, cgl8506678, c
  • the group from which the additional sites are selected can be expanded to the 75 or 100 or 125 or 150 highest ranking sites listed in FIGS. 9, 15-16 or 20.
  • the count of sites selected can be expanded as the group is expanded, so that 50, 75, 100, 125 or 150 sites from the group are selected.
  • Example sums, based on the values given in the figures are: the selection.
  • Example thresholds, based on the values given in the figures are:
  • This method can be extended by observing cytosine methylation of more sites.
  • the additional sites can be at thirty-seven CG loci in the genomic DNA selected from the group consisting of CG locus designation which are the remaining 181 sites listed in FIG.
  • the comparing and correlating are applied to the at least 37 additional CG loci selected from the group.
  • the method can be extended by observing cytosine methylation of up to 95 additional CG loci in the genomic DNA selected from the group consisting of CG locus designation which are the remaining 181 sites listed in FIG. 9.
  • the method can apply a random forest or linear regression model to data including the CG loci designations in the genomic DNA to calculate the epigenetic age.
  • the linear regression model can use coefficients corresponding to the CG loci designations in the genomic DNA.
  • the method also can be extended by observing cytosine methylation of one, two, three, four or more sites identified by markers cgl5769472, cg05697231, cg03545227, cgl6655791 and cg23686029, then further applying the comparing and correlating.
  • the DNA sample can be obtained from a finger prick blood sample.
  • a method of obtaining information useful to determine an age of an individual including observing cytosine methylation of cg00530720, cg02383785, cg02447229, cg03545227, cg03607117, cg04028010, cg04777312, cg04903884, cg05697231, cg06419846, cg06540876, cg06745229, cg07073120, cg07124372, cg07394446, cg07502389, cg07642291, cg07843568, cg08097417, cgl0070101, cgl0091775, cgl 1359984, cgl2112234, cgl3663218, cgl3702357, cgl3740515, cgl4283887, cgl5769
  • implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the method described above.
  • implementations may include a system configured to perform the actions of the methods described herein.
  • Another method disclosed generates an epigenetic age prediction or can be applied to generate information useful for epigenetic age prediction.
  • This method includes obtaining genomic DNA from blood cells derived from the individual and observing cytosine methylation at CG loci designations in the genomic DNA.
  • the observing includes at least 42 CG loci designations in the genomic DNA selected from the group of 186 CG loci designations with accompanying c values in FIG. 19, such that the observed CG loci designations have a summation of c values greater than 33.5 or 33.6 or 40 or 45 or 50.
  • the observing comprises performing a bisulfate conversion process on the genomic DNA so that cytosine residues in the genomic DNA are transformed to uracil.
  • the method further includes comparing and correlating the observed CG locus methylation with the CG locus methylation observed in genomic DNA from blood cells derived from a group of individuals of known ages.
  • a result of the method is that information useful to determine an epigenetic age of the individual is obtained.
  • This method implementation and other systems disclosed optionally include one or more of the following features.
  • Methods can also include features described in connection with systems disclosed. In the interest of conciseness, alternative combinations of method features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
  • the method can further include applying a linear regression or random forest model to data including the observed loci designations in the genomic DNA to calculate the epigenetic age.
  • the observed loci designations can have a summation of o squared variance values greater than 29.4 or 30.1 or 35 or 38 or 41. This can be an alternative to the c values standard deviation criteria, as variance is the square of the standard deviation.
  • the observed loci designations can have a summation of coefficient magnitudes in FIGS. 15-16 greater than 858, 900, 950 or 1000. This can be an alternative to the c values standard deviation criteria or a supplemental criteria.
  • the method can include applying a linear regression or random forest model to data including the observed loci designations in the genomic DNA to calculate the epigenetic age.
  • the method can further include observing cytosine methylation at CG loci designations in the genomic DNA, comprising observing at least 60, 80 or 100 CG loci designations in the genomic DNA selected from the group.
  • Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the method described above.
  • Yet another implementation may include a system configured to perform the actions of the methods described herein.
  • a method of generating an epigenetic age prediction includes receiving a sequence of inputs corresponding to methylation values at a plurality of CpG sites.
  • receiving the sequence includes receiving at least 42 CpG sites.
  • the method includes applying a model on the sequence of inputs to predict the epigenetic age prediction.
  • the plurality of CpG sites include sites identified by markers cg27330757, cg04777312, cgl3740515, cg07642291 and cg27405400.
  • This method implementation and other systems disclosed optionally include one or more of the following features.
  • Methods can also include features described in connection with systems disclosed. In the interest of conciseness, alternative combinations of method features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
  • the model in the method includes a random forest model.
  • the model in the method includes a linear regression model.
  • the model includes a plurality of coefficients corresponding to one or more of the CpG sites.
  • absolute values of coefficients corresponding to sites identified by markers cg27330757, cg04777312, cgl3740515, cg07642291 and cg27405400 are in the top half of absolute values of the plurality of coefficients.
  • the absolute values of coefficients corresponding to sites identified by markers cg27330757, cg04777312, cgl3740515, cg07642291 and cg27405400 are in the top quartile of absolute values of the plurality of coefficients.
  • the plurality of CpG sites include one or two or three or four or more sites identified by markers cgl 5769472, cg05697231, cg03545227, cgl6655791 and cg23686029.
  • the plurality of CpG sites include less than 50 sites.
  • the plurality of CpG sites include five or ten or fifteen or twenty or more of sites identified by markers cgl6300556, cg08097417, cg02383785, cg23197007, cg07073120, cg04028010, cg02447229, cg07394446, cg07124372, cg06745229, cgl4283887, cgl l359984, cgl8691434, cgl2112234, cgl7243289, cg03607117, cgl3663218, cgl0091775, cg06540876, cg04903884, cg07502389, cgl3702357, cg22575379, cgl8506678, cg00530720, cg07843568, cg06419846, cgl0070
  • receiving the sequence of inputs in the method includes determining the methylation values at the plurality of CpG sites based on a blood sample.
  • a method of generating an epigenetic clock predictor includes receiving a plurality of methylation profiles from a plurality of individuals, the plurality of methylation profiles comprising methylation values for m CpG sites.
  • the method includes training a model based on the plurality of methylation profiles, the model being configured to predict an epigenetic age based on methylation values for n CpG sites.
  • the n CpG sites contains fewer CpG sites than m CpG sites.
  • the n CpG sites includes one or more of the CpG sites identified by the following markers: cg27330757, cg04777312, cgl3740515.
  • 42 ⁇ n CpG sites ⁇ 200.
  • the method includes selecting n CpG sites as a subset from m CpG sites.
  • selecting n CpG sites comprises applying elastic net regression on the plurality of methylation profiles.
  • training the model is based only on methylation values for n CpG sites in the plurality of methylation profiles.
  • training the model comprises training a linear regression model.
  • training the model comprises training a random forest model.
  • the n CpG sites includes one or two or more of the CpG sites identified by the following markers: cg23197007, cg07073120 and cgl 1359984.
  • an epigenetic age predictor in another implementation, we disclose an epigenetic age predictor.
  • the epigenetic age predictor includes an input component configured to receive a sequence of inputs corresponding to methylation values at CpG sites.
  • the CpG sites include the CpG site identified by marker cg27330757.
  • the epigenetic age predictor predicts an epigenetic age of an individual based on the sequence of inputs.
  • the epigenetic age predictor comprises a linear regression model or a random forest model.
  • the CpG sites further include the CpG sites identified by markers cg04777312, cgl 3740515 and cg07642291.
  • a method of generating an epigenetic age prediction optionally includes providing a sample extraction test kit.
  • the method includes receiving a sample extracted by the sample extraction test kit.
  • the method includes extracting DNA from the sample.
  • the method optionally includes processing the DNA to receive processed DNA.
  • the method optionally includes amplifying a plurality of loci in the processed DNA to receive amplified DNA.
  • the method includes processing DNA to receive a plurality of methylation values for at least one or 42 to more CpG sites in the plurality of loci. Information useful to determine an epigenetic age of the individual can be obtained by this method.
  • This method implementation and other systems disclosed optionally include one or more of the following features.
  • Methods can also include features described in connection with systems disclosed. In the interest of conciseness, alternative combinations of method features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
  • the method can further include processing the plurality of loci by applying multiplex polymerase chain reactions (MPCR) that expose the DNA to a plurality of primer pairs in a single reaction mixture to amplify a subset of the 42 or more plurality of loci and concurrently performing a plurality of MPCRs for the sample.
  • MPCR multiplex polymerase chain reactions
  • amplifying the plurality of loci in the processed DNA in the method includes performing a polymerase chain reaction (PCR) to receive the amplified DNA.
  • amplifying the plurality of loci in the processed DNA in the method includes performing a multiplex polymerase chain reaction (MPCR) to receive the amplified DNA.
  • PCR polymerase chain reaction
  • MPCR multiplex polymerase chain reaction
  • performing the MPCR to receive the amplified DNA in the method includes using a plurality of primer pairs with the processed DNA to amplify the plurality of loci.
  • the plurality of primer pairs in the method correspond to one or two or more sites in the plurality of loci identified by markers cg27330757, cg04777312, cgl3740515, cg07642291 and cg27405400.
  • the plurality of primer pairs in the method correspond to one or two or more sites in the plurality of loci identified by markers cgl 5769472, cg05697231, cg03545227, cgl6655791 and cg23686029.
  • the plurality primer pairs in the method correspond to one or two or more of sites in the plurality of loci identified by markers cgl6300556, cg08097417, cg02383785, cg23197007, cg07073120, cg04028010, cg02447229, cg07394446, cg07124372, cg06745229, cgl4283887, cgl l359984, cgl8691434, cgl2112234, cgl7243289, cg03607117, cgl3663218, cgl0091775, cg06540876, cg04903884, cg07502389, cgl3702357, cg22575379, cgl8506678, cg00530720, cg07843568, cg06419846, cgg
  • processing the DNA in the method includes treating the quality- controlled DNA with sodium bisulfite to produce treated DNA.
  • the sample in the method is a blood sample.
  • the one or more CpG sites in the method include a plurality of
  • the plurality of CpG sites include 42 sites.
  • processing the amplified DNA in the method includes sequencing the amplified DNA to obtain sequence data indicative of the plurality of methylation values for the one or more CpG sites.
  • a method of generating an epigenetic clock predictor includes receiving a plurality of methylation profiles from a plurality of individuals based on sequence data of the plurality of individuals, the sequence data corresponding to a degree of methylation at a plurality of CpG sites.
  • the method includes training a model based on the plurality of methylation profiles, the model being configured to predict an epigenetic age based on methylation values obtained from the sequence data.
  • the sequence data in the method corresponds to the degree of methylation at the CpG sites identified by markers cg27330757, cg04777312, and cgl3740515.
  • the plurality of CpG sites include 42 sites.
  • the sequence data in the method is generated by amplifying a plurality of loci, the plurality of loci including the plurality of CpG sites.
  • training the model includes training a linear regression model.
  • training the model includes training a random forest model.
  • the method includes receiving a sample.
  • the method includes extracting DNA from the sample.
  • the method includes treating the DNA with sodium bisulfite to receive treated DNA.
  • the method includes amplifying a plurality of loci in the treated DNA using a plurality of primer pairs to receive a plurality of amplified DNA strands, the plurality of amplified DNA strands including one or more CpG sites.
  • the method includes sequencing the plurality of amplified DNA strands to receive sequence data indicative of methylation at the one or more CpG sites.
  • the method includes processing the sequence data to receive a plurality of methylation values for the one or more CpG sites.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Public Health (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Biotechnology (AREA)
  • Genetics & Genomics (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Microbiology (AREA)
  • Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un procédé permettant de générer des informations utiles à la prédiction de l'âge épigénétique. Un ensemble de 186 sites CpG significatifs est identifié. Diverses utilisations de sous-ensembles de ces sites sont décrites dans la présente invention, ainsi que des conseils pour la sélection de ces sous-ensembles. Le procédé peut être étendu en recevant un échantillon extrait à l'aide d'un kit de collecte d'échantillons à domicile. Il est possible de prélever du sang ou de la salive par piqûre au doigt à l'aide d'un kit de test d'extraction d'échantillon à domicile. Lorsque 42 à 100 ou 200 ou 300 sites CpG sont sélectionnés, des procédés basés Sur la PCR peuvent être appliqués à des échantillons avec moins de coût et moins de temps de traitement que l'utilisation d'une puce de réseau pour évaluer la méthylation. L'invention concerne également des systèmes, des supports lisibles par ordinateur et des programmes informatiques.
PCT/IB2022/060944 2021-11-12 2022-11-14 Génération d'informations sur l'âge épigénétique WO2023084486A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US17/525,552 2021-11-12
US17/525,552 US20230154560A1 (en) 2021-11-12 2021-11-12 Epigenetic Age Predictor
US17/831,427 2022-06-02
US17/831,427 US11781175B1 (en) 2022-06-02 2022-06-02 PCR-based epigenetic age prediction

Publications (1)

Publication Number Publication Date
WO2023084486A1 true WO2023084486A1 (fr) 2023-05-19

Family

ID=84361943

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2022/060944 WO2023084486A1 (fr) 2021-11-12 2022-11-14 Génération d'informations sur l'âge épigénétique

Country Status (1)

Country Link
WO (1) WO2023084486A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2711431A1 (fr) * 2012-09-24 2014-03-26 Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen Procédé pour déterminer l'âge d'un individu humain
US20150259742A1 (en) * 2012-11-09 2015-09-17 The Regents Of The University Of California Methods for predicting age and identifying agents that induce or inhibit premature aging
US20190185938A1 (en) * 2016-08-05 2019-06-20 The Regents Of The University Of California Dna methylation based predictor of mortality
US20200190568A1 (en) * 2018-12-10 2020-06-18 OneSkin Technologies, Inc. Methods for detecting the age of biological samples using methylation markers

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2711431A1 (fr) * 2012-09-24 2014-03-26 Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen Procédé pour déterminer l'âge d'un individu humain
US20150259742A1 (en) * 2012-11-09 2015-09-17 The Regents Of The University Of California Methods for predicting age and identifying agents that induce or inhibit premature aging
US20190185938A1 (en) * 2016-08-05 2019-06-20 The Regents Of The University Of California Dna methylation based predictor of mortality
US20200190568A1 (en) * 2018-12-10 2020-06-18 OneSkin Technologies, Inc. Methods for detecting the age of biological samples using methylation markers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JAGANATHAN, K. ET AL.: "Predicting splicing from primary sequence with deep learning", CELL, vol. 176, 2019, pages 535 - 548
VIDAKI ATHINA ET AL: "DNA methylation-based forensic age prediction using artificial neural networks and next generation sequencing", FORENSIC SCIENCE INTERNATIONAL: GENETICS, ELSEVIER BV, NETHERLANDS, vol. 28, 28 February 2017 (2017-02-28), pages 225 - 236, XP029963808, ISSN: 1872-4973, DOI: 10.1016/J.FSIGEN.2017.02.009 *

Similar Documents

Publication Publication Date Title
Rooijers et al. Simultaneous quantification of protein–DNA contacts and transcriptomes in single cells
KR102165734B1 (ko) 심층 컨볼루션 신경망을 사전 훈련시키기 위한 심층 학습 기반 기술
EP2938745B1 (fr) Évaluation de l'activité d'une voie de signalisation cellulaire faisant appel à une ou des combinaisons linéaires d'expressions de gènes cibles
CA3065784A1 (fr) Classificateur de variantes base sur des reseaux neuronaux profonds
US20220106642A1 (en) Multiplexed Parallel Analysis Of Targeted Genomic Regions For Non-Invasive Prenatal Testing
JP2008533558A (ja) 遺伝子型分析のための正規化方法
JP2016165286A (ja) 転写物測定値数が減少した、遺伝子発現プロファイリング
Zheng et al. A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level
O'brien et al. Using genome-wide expression profiling to define gene networks relevant to the study of complex traits: from RNA integrity to network topology
CN110770839A (zh) 来自未知基因型贡献者的dna混合物的精确计算分解的方法
US11879157B2 (en) Target-enriched multiplexed parallel analysis for assessment of risk for genetic conditions
Mouratidis et al. Nucleic Quasi-Primes: Identification of the Shortest Unique Oligonucleotide Sequences in a Species
WO2023084486A1 (fr) Génération d'informations sur l'âge épigénétique
EP1630709B1 (fr) Analyse mathématique pour l'estimation de modifications du niveau d'expression d'un gène
US11781175B1 (en) PCR-based epigenetic age prediction
WO2022015998A1 (fr) Panels de gènes et leurs procédés d'utilisation pour le criblage et le diagnostic de malformations et de maladies cardiaques congénitaux
US20230154560A1 (en) Epigenetic Age Predictor
US20230154566A1 (en) Epigenetic age predictor
Rooijers et al. scDam&T‐seq combines DNA adenine methyltransferase-based labeling of protein-DNA contact sites with transcriptome sequencing to analyze regulatory programs in single cells
García-Pérez et al. Gene regulatory architectures dissect the evolutionary dynamics of regulatory elements in humans and non-human primates
JP2023552015A (ja) 遺伝子変異を検出するためのシステム及び方法
KR102665592B1 (ko) 유전적 변이의 비침습 평가를 위한 방법 및 프로세스
Moskvin et al. Making sense of RNA-Seq data: from low-level processing to functional analysis
KR20240068794A (ko) 유전적 변이의 비침습 평가를 위한 방법 및 프로세스
ZA200110490B (en) Mathematical analysis for the estimation of changes in the level of gene expression.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22812801

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)