CROSS-REFERENCE TO RELATED APPLICATIONS
-
This application claims the benefit of U.S. Provisional Patent Application No. 62/555,597, filed on Sep. 7, 2017; the content of this application is hereby incorporated by reference in its entirety. In addition, co-pending application entitled “Systems and Methods For Leveraging Relatedness In Genomic Data Analysis” filed on Sep. 7, 2018 is also incorporated by reference in its entirety.
FIELD
-
The disclosure relates generally to a prediction model of relatedness in a human population. More particularly, the disclosure relates to systems and methods for preparing a relatedness model in a human population and identifying a model for selecting a subset of individuals from a population for a genetic study.
BACKGROUND
-
Human disease conditions are not only caused and influenced by environmental factors, but also genetic factors. An understanding of genetic variation in human populations is therefore important for an understanding of the etiology and progression of human diseases, as well as for the identification of novel drug targets for the treatment of these diseases.
-
Genetic studies of health care populations are particularly useful in this regard because of the availability of extensive health care data, which simplifies the research of how genetic variants contribute to disease conditions in humans. In the past, such studies were usually based on genome-wide genetic linkage analyses to map disease loci, which, once identified, could then be further analyzed in detail on the molecular level. Over the last few years, the widespread availability of high-throughput DNA sequencing technologies has allowed the parallel sequencing of the genomes of hundreds of thousands of humans. In theory, the data obtained from high-throughput DNA sequencing technologies represent a powerful source of information that can be used to decipher the genetic underpinnings of human diseases. The number and scale of such large human sequencing projects, including DiscovEHR (Dewey et al. (2016) Science, 354, aaf6814), UK Bio bank/the US government's All of Us (part of the Precision Medicine Initiative) (Collins and Varmus (2015) N. Engl. J. Med. 372, 793-795), TOPMed, ExAC/gnomAD (Lek et al. (2016) Nature 536, 285-291); and many others, are rapidly growing. Many of these studies are collecting samples from integrated health care populations that have accompanying phenotype-rich electronic health records (EHRs) with a goal of combining the EHRs and genomic sequence data to catalyze translational discoveries and precision medicine (Dewey et al. (2016) Science, 354, aaf6814).
-
Traditionally, the high expense of large-scale genetic studies and the limited resources of individual investigators have generated study populations exhibiting shallow ascertainment of individuals from a variety of geographical areas. To improve statistical power, researchers combine samples from many different collection centers into larger cohorts, and these cohorts are often merged into larger consortiums consisting of tens to hundreds of thousands of individuals. Although the total number of individuals sampled is often high, these studies typically only sample a relatively small portion of individuals in any given geographic area. Because such traditional population-based studies have generally collected samples from multiple geographical areas, they most commonly exhibit the broadest “class” of relatedness: population structure. Population structure (often referred to as “substructure” or “stratification”) within a genetic study results when the allele frequencies of different ancestral groups, or “genetic demes,” are more similar within than between demes. Genetic demes arise as a result of more-recent genetic isolation, drift, and migration patterns. Ascertainment of individuals within genetic demes can generate distant cryptic relatedness (Henn et al. (2012) PLoS ONE 7, e324267; Han et al. (2017) Nat. Commun. 8, 14238), the second “class” of relatedness, defined here as third- to ninth-degree relatives. These distant relatives are unlikely to be identifiable from the EHR, but are important because usually one or more large segments of their genomes are identical by descent, depending on their degree of relatedness and the recombination and segregation of alleles (Huff et al. (2011) Genome Res. 21, 768-774). Distant cryptic relatedness is usually limited in study cohorts built from small samplings of large populations, but the level of cryptic relatedness increases substantially as the effective population size decreases and the sample size increases. Finally, unless designed to collect families, traditional population based studies typically have very little family structure: the third “class” of relatedness, consisting of first- and second degree relationships (Sudlow et al. PLoS Med. 12, e1001779; Han et al. (2017) Nat. Commun. 8, 14238; Fuchsberger et al. (2016) Nature 536, 41-47; Locke et al. (2015) Nature 51, 197-206; Surendran et al. (2016) Nat. Genet 48, 1151-1161).
-
An increase in family structure in a cohort can have significant implications for the choice and execution of downstream analyses and must be considered thoughtfully. In order to select a statistical tool for analyzing any population, the knowledge about the amount of relatedness in the population plays a critical role (Santorico et al. (2014) Genet. Epidemiol. 38 (Suppl 1), S92-S96; Hu et al. (2014) Nat. Biotechnol. 32, 663-669; Price et al. (2010) Nat. Rev. Genet. 11, 459-463; Kang et al. (2010) Nat. Genet. 42, 348-354; Sun and Dimitromanolakis (2012) Methods Mol. Biol 850, 47-57; Devlin and Roeder (1999) Biometrics 55, 997-1004; and Voight and Pritchard (2005) PLoS Genet. 1, e32). For example, some tools (e.g., principal component [PC] analysis) assume all individuals are unrelated, some (e.g., linear mixed models) effectively handle estimates of pairwise relationships, and others (e.g., linkage and TOT analyses) can directly leverage pedigree structures.
-
Removal of family structure (i.e., selectively excluding samples to eliminate relationships) reduces the sample size and power while discarding potentially valuable relationship information. In a pedigree structure needed for an analysis or visualization, relatedness can be used for reconstructing pedigree structures directly from the genetic data with tools such as PRIMUS (Staples et al. (2014) Am. J. Hum. Genet. 95, 553-564) and CLAPPER (Ko and Neilsen (2017) PLoS Genet. 95, 553-564). The utility of relatedness and family structure in datasets can provide insights into identification and characterization of variants in the dataset with relatedness. Thus, in order to better analyze genetic data leveraging relatedness to study by reconstructing pedigrees, phasing compound heterozygous mutation (CHM), and detecting de novo mutations (DNM), a dataset with relatedness among its population is helpful.
-
However, the growing size of datasets today require constant innovation of bioinformatics tools and analysis pipelines to continue handling them efficiently. When selecting a dataset, it is often unclear how much relatedness researchers should expect to see and whether it would follow the levels of relatedness seen in previous population-based genomic studies. Given the impact of relatedness on downstream analyses, there is a need to determine whether this amount of relatedness is expected, whether it is unique to a dataset, and how much it would grow as the sequenced cohort expands. The disclosure addresses this need.
SUMMARY
-
In one aspect, an exemplary embodiment of the disclosure provides a prediction model of relatedness in a human population. The prediction model may be prepared by a process that comprises establishing a first population dataset; performing a burn-in phase of 120 years to establish a second population dataset; and modifying the second population dataset by conducting one or more of the following steps: (a) move individuals in the second population dataset to an age pool in accordance with the age of the individuals; (b) chose pairs of a single man and a single woman being more distantly related than first-cousins at random from single men and single women in the second population dataset and let them marry at specified marriage by age parameters, wherein pairs are chosen until a number of marriages is reached as specified by marriage rate parameters; (c) divorce married couples at a specified divorce rate, wherein married couples are chosen at random from the second population dataset and marked as single upon divorce; (d) chose pairs of a single man and a single woman or married couples at random from the second population dataset in a specified ratio and allow them to reproduce according to specified fertility rates until a target number of successful conceptions is reached, wherein parents are restricted to being more distantly related than first cousins, and wherein all individuals in the second population dataset are limited to having one child per year; (e) allow individuals in the second population dataset to pass away at a specified death rate and at specified mortality by age parameters; (f) allow individuals to migrate to and from the second population dataset, whereby the population's age and sex distributions and the proportion of married fertile aged individuals in the second population dataset are maintained; and (g) allow individuals to move within the second population dataset, whereby individuals from a sub-population are selected at random and assigned at random to another sub-population if present until a specified move rate between sub-populations is achieved; repeat one or more of steps (a) to (g) reiteratively at one year intervals for a pre-determined number of years, wherein the steps are applied to the population dataset resulting from the previous reiteration in order to generate a final population dataset representing the prediction model of relatedness in a human population at a predetermined time.
-
In some exemplary embodiments, establishing the first population dataset further includes specifying a number of sub-populations and sizes.
-
In some exemplary embodiments, establishing the first population dataset further includes assigning ages to individuals in the first population dataset between zero and a maximum age of fertility.
-
In some exemplary embodiments, the maximum age of fertility is 49 years.
-
In some exemplary embodiments, performing the burn-in phase further includes keeping numbers of births and deaths of individuals in the second population dataset equal and the rate of net migration of individuals zero.
-
In some exemplary embodiments, performing the burn-in phase further includes moving individuals second population dataset from a juvenile pool to a mating pool as individuals age above a minimum age of fertility; and moving individuals from the mating pool to an aged pool as individuals age above a maximum age of fertility; and removing individuals from all age pools if the individuals emigrate or pass away.
-
In some exemplary embodiments, the minimum age of fertility is 15 years and the maximum age of fertility is 49 years.
-
In another aspect, an exemplary embodiment of the disclosure provides a method of using the prediction model, wherein ascertaining individuals is performed at random.
-
In another aspect, the disclosure provides a method of using the prediction model, wherein ascertaining individuals is performed in a clustered fashion.
-
In some exemplary embodiments, ascertaining individuals further includes gathering relatedness data and relevant statistics about ascertained individuals including first- or second-degree relationships among ascertained individuals, or both.
-
In some exemplary embodiments, the prediction model may further comprise selecting the human population for genetic analysis based on the final population dataset. The genetic analysis may include pedigree reconstruction, phasing compound heterozygous mutations, detecting de novo mutations, or combinations thereof.
-
In some exemplary embodiments, the human population includes multiple human populations and generating the final population data set includes generating a final population dataset for each of the multiple human populations, and further comprising selecting one of the multiple human populations for genetic analysis based on the final population datasets.
BRIEF DESCRIPTION OF THE DRAWINGS
-
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
-
FIG. 1 is a flow chart of a method of making a prediction model of relatedness in a human population according to one exemplary embodiment.
-
FIG. 2 is an exemplary operating environment.
-
FIG. 3 illustrates a plurality of system components configured for performing the disclosed methods.
-
FIGS. 4A and 4B show a comparison between the ascertainment of first-degree relatives among 61K DiscovEHR participants and random ascertainment of simulated populations according to an exemplary embodiment of the disclosure. Panel A shows ascertainment of first-degree relative pairs and Panel B shows ascertainment of number of individuals with more than one first-degree relatives.
-
FIGS. 5A and 5B show a comparison between the ascertainment of first-degree relatives among 92K expanded DiscovEHR participants compared to random ascertainment of simulated populations according to an exemplary embodiment of the disclosure. Panel A shows ascertainment of first-degree relative pairs and Panel B shows ascertainment of number of individuals with more than one first-degree relatives.
-
FIGS. 6A, 6B, 6C, and 6D show a simulated population and fit of a clustered ascertainment approach to the accumulation of first-degree relatedness in the DiscovEHR cohort according to an exemplary embodiment of the disclosure. Panel A shows accumulation of pairs of first-degree relatives; Panel B shows the proportion of the ascertained participants that have one or more first-degree relatives; Panel C shows simulated ascertainment projections with upper and lower bounds of the number of first-degree relationships; and Panel D shows simulated projections with upper and lower bounds of the proportion of the ascertained participants that have 1 or more first-degree relatives.
-
FIGS. 7A, 7B, 7C, and 7D show a simulated population and fit of a clustered ascertainment approach to the accumulation of first-degree relatedness in the expanded DiscovEHR cohort according to an exemplary embodiment of the disclosure. Panel A shows accumulation of pairs of first-degree relatives; Panel B shows the proportion of the ascertained participants that have one or more first-degree relatives; Panel C shows simulated ascertainment projections with upper and lower bounds of the number of first-degree relationships; and Panel D shows simulated projections with upper and lower bounds of the proportion of the ascertained participants that have 1 or more first-degree relatives.
-
FIGS. 8A, 8B, 8C and 8D show a simulated population and fit of a clustered ascertainment approach to the accumulation of first- and second-degree relatedness in the DiscovEHR cohort according to an exemplary embodiment of the disclosure. Panel A shows accumulation of pairs of first- and second-degree relatives; Panel B shows the proportion of the ascertained participants that have one or more first- and second-degree relatives; Panel C shows simulated ascertainment projections with upper and lower bounds of the number of first- and second-degree relationships; and Panel D shows simulated projection with upper and lower bounds of the proportion of the ascertained participants that have 1 or more first- or second-degree relatives.
-
FIGS. 9A, 9B, 9C, and 9D show a simulated population and fit of a clustered ascertainment approach to the accumulation of first- and second-degree relatedness in the expanded DiscovEHR cohort according to an exemplary embodiment of the disclosure. Panel A shows accumulation of pairs of first- and second-degree relatives; Panel B shows the proportion of the ascertained participants that have one or more first- and second-degree relatives; Panel C shows simulated ascertainment projections with upper and lower bounds of the number of first- and second-degree relationships; and Panel D shows simulated projection with upper and lower bounds of the proportion of the ascertained participants that have 1 or more first- or second-degree relatives.
-
FIG. 10 shows some of the factors that drive the amount of relatedness in an ascertained dataset modeled according to an exemplary embodiment of the disclosure.
DETAILED DESCRIPTION
-
The term “a” should be understood to mean “at least one”; and the terms “about” and “approximately” should be understood to permit standard variation as would be understood by those of ordinary skill in the art; and where ranges are provided, endpoints are included.
-
Previous large scale human genomic studies typically collected human samples across a number of different geographic areas and/or health care systems and combined them to generate cohorts for analysis. While the total number of individuals sampled in these cohorts was often high, the extent of relatedness and family structure in these cohorts tended to be relatively low. Many statistical methods commonly used in the context of genome analysis, including association analysis and principle component analysis, require that all samples are unrelated. Otherwise, the statistical outputs of these tests will be biased, resulting in inflated p-values and false positive findings (Kang et al. (2010), Nature Publishing Group 42, 348-354; Sun and Dimitromanolakis (2012), Methods Mol. Biol. 850, 47-57; Devlin and Roeder (1999), Biometrics 55, 997-104; and Voight and Pritchard (2005), PLoS Genet 1, e32-10).
-
Removal of family structure from a dataset is a viable option if the dataset has only a handful of closely related samples (Lek, et al. (2016), Nature Publishing Group 536, 285-291; Fuchsberger et al. (2016), Nature Publishing Group 536, 41-47; Locke et al. (2015), Nature 518, 197-206; and Surendran et al. (2016) Nat Genet 48, 1151-1161). Removal of family structure is also a possible option if the unrelated subset of the data is adequate for the statistical analysis, such as computing principle components (PCs) and then projecting the remaining samples onto these PCs (Dewey et al. (2016), Science 354, aaf6814-aaf6814). A number of methods exist to help investigators retain the maximally sized unrelated set of individuals (Staples at al. (2013), Genet. Epidemiol. 37, 136-141; Chang at al. (2015), Gigascience 4, 7). Unfortunately, removal of related individuals not only reduces the sample size, but also discards the valuable relationship information. In fact, such a loss of information is unacceptable for many analyses if the dataset has even a moderate level of family structure.
-
Genetic relatedness between individuals plays an important role in many fields of genetics. In genetic analyses, knowledge of relatedness is used to estimate genetic parameters such as heritability and genetic correlation (Falconer and Mackay (1996) Introduction to Quantitative Genetics. Longmans Green, Harlow, Essex, UK). In evolutionary biology, knowledge of relatedness between interacting individuals is required to predict evolutionary consequences of social interaction (Hamilton (1964) Theor. Biol. 7, 17-52). In conservation genetics, knowledge of relatedness is required to optimize conservation strategies. Information about the relatedness of a population in a cohort can have important applications in many research areas in quantitative genetics, conservation genetics, forensics, evolution and ecology. Genetic relatedness among individuals in a cohort is a continuum that manifests itself within a cohort in a variety of ways depending on the population and how individuals are sampled from it. The increase in relatedness within healthcare population based genomic (HPG) studies has significant implications when choosing and executing downstream analyses and must be considered thoughtfully (Santorico et al. (2014) Genet. Epidemiol. 38 Suppl 1, S92-S96; Hu et al. (2014). Nat. Biotechnol. 32, 663-669; Price et al. (2010) Nat. Rev. Genet. 11, 459-463; Kang et al. (2010). Nature Publishing Group 42, 348-354; Sun and Dimitromanolakis (2012) Methods Mol. Biol. 850, 47-57; Devlin and Roeder (1999) Biometrics 55, 997-104; Voight and Pritchard (2005) PLoS Genet 1, e32-10). The genetic data leveraging relatedness can be used to reconstruct pedigrees, phase compound heterozygous mutation (CHM), and detect de novo mutations (DNM). Further, the data can also be used to predict population growth and provide markers to indicate disease patterns among a population.
-
In order to analyze such data, a dataset comprising individuals with relatedness is desirable. Further, there are different statistical tools that can be applied to a dataset based on the degree for relatedness among the individuals in the dataset. The datasets available or the datasets to be designed would require constant innovation of bioinformatics tools and analysis pipelines to continue handling them efficiently.
-
For a genetic or evolutionary or census study, in order to select or design a dataset, there exists no method or model that can predict the degree of relatedness in the cohort, and researchers are often unclear as to how much relatedness researchers should expect to see and if the levels of relatedness would be similar to the levels of relatedness seen in previous population-based genomic studies.
-
The disclosure is based, at least in part, on a prediction model of relatedness in a human population.
-
A prediction model of relatedness in a human population according to an exemplary embodiment of the disclosure can be used to predict populations of millions of people dispersed across one or more sub-populations based on specified population parameters. The model progresses year-to-year simulating couplings, births, separations, migrations, deaths, and/or movement between sub-populations based on specified parameters generating realistic pedigree structures and populations that represent a wide variety of population-based studies, including HGP studies. The parameters can be easily customized to model different populations.
-
Exemplary embodiments of the disclosure also are based, at least in part, on a process of preparing a prediction model of relatedness in a human population to estimate the amount of relatedness researchers should expect to find for a given set of populations and sampling parameters. An example of a process to creating such a model is described in FIG. 1.
-
According to an exemplary embodiment of the disclosure, the process of preparing a prediction model of relatedness in a human population can comprise establishing a first population dataset as step 100. This first population dataset can be defined by the user.
-
In some exemplary embodiments, a burn-in phase of predetermined time is performed to establish a second population dataset as step 120. The burn-in phase can vary based on the study and can be selected by the user. In a specific exemplary embodiment, the burn-in phase can range from 90-200 years, including any desired value in the range. In another specific exemplary embodiment, the burn-in phase is about 120 years.
-
In some exemplary embodiments, the initial ages of individuals in the second population dataset can range from 0 to 49 years. The individuals in this second population dataset can be assigned into different pools, for example, juveniles or fertile/mating or aged. For example, under the age of about 15, individuals can be assigned to juvenile pool. Between the ages of 15 and 49, individuals can be assigned to fertile/mating pool. Further, individuals can also be moved from juvenile pool to fertile/mating pool when they are above 15 and from fertile/mating pool to aged pool when they are above 49. The individuals in this dataset can be further removed if they emigrate or die. A user can select these age groups that the pools comprise based on the demographics or history of the geographical area or ancestral class or any other parameter that can influence the groups. In the second population dataset, a user can further set a birth rate, death rate, in-migration rate, out-migration rate, coupling rate, separation rate, fertility start age, fertility end age, full-sibling rate range for fertility by age, male mortality by age, female mortality by age, male coupling by age, and/or female coupling by age according to the demographics or history of the geographical area or ancestral class or any other parameters (with their corresponding values and rates) that can influence such rates. For example, twin rates, still-birth rates, abortion rates, rates of same sex marriage, adoption rates, rate of polyamorous relationships can be used to set the parameters. Additionally, parameters can also be modeled on the basis of the geographical location of people within a population (e.g. where they live and work relative to each other) and geographical/social barriers that could increase or decrease the chance of mating (e.g. rivers, valleys, mountains, ancestral backgrounds, and neighborhoods). In some exemplary embodiments, the second population can have a birth rate of about 0.0219 or death rate of about 0.0095 or coupling rate of about 0.01168 or separation rate of about 0.0028 or full-sibling rate of about 0.88 or fertility start age of about 15 years or fertility end age of about 49 years or in-migration rate of about 0.01 or out-migration rate of about 0.021or fertility by age (weighting vector for women ages 0 to 50) in the range of 0 to 1 or male mortality by age (weighting vector for men ages 0 to 120) in the range of 0 to 1 or female mortality by age (weighting vector for women ages 0 to 120) in the range of 0 to 1 or male coupling by age (weighting vector for men ages 0 to 50) in the range 0 to 1 or female coupling by age (weighting vector for women ages 0 to 50) in the range 0 to 1 or combinations thereof.
-
In some exemplary embodiments, a second population established can be modified by moving individuals in the second population dataset to an age pool as step 130 in accordance with the age of the individuals—juveniles, fertile/mating or aged.
-
In some exemplary embodiments, the established second population can be further modified by choosing the pairs of a single man and a single woman being more distantly related than first-cousins at random from single men and single women in the second population dataset and let them marry at specified marriage by age parameters as step 140. The pairs chosen to marry can be allowed to marry until the numbers of marriages are reached as specified by the set marriage rate parameters. The user can select the marriage by age parameters based on the demographics or history of the geographical area or ancestral class or any other parameters (with their corresponding values and rates) that can influence such rates. For example, twin rates, still-birth rates, abortion rates, rates of same sex marriage, adoption rates, rate of polyamorous relationships can be used to set the parameters. Additionally, parameters can also be modeled on the basis of the geographical location of people within a population (e.g. where they live and work relative to each other) and geographical/social barriers that could increase or decrease the chance of mating (e.g. rivers, valleys, mountains, ancestral backgrounds, and neighborhoods).
-
To further modify the second population dataset, the user can select the divorce rates and/or reproduction rate based on demographics or history of the geographical area or ancestral class or any other parameter that can influence such rates. In some exemplary embodiments, the established second population can be modified to allow the married couples to be divorced at a specified divorce rate as step 150. The chosen pairs of a single man and a single woman or married couples at random from the second population dataset can be selected in a specified ratio and can be allowed to reproduce according to specified fertility rates as step 160 until a target number of successful conceptions are reached. The parents can be restricted to being more distantly related than first cousins. Further, all individuals in the mating/fertile age pool of the second population dataset can be limited to having one child per year.
-
Further, the second population dataset can be modified by setting a death rate and/or migration based on the demographics or history of the geographical area or ancestral class or any other parameters (with their corresponding values and rates) that can influence such rates.
-
In some exemplary embodiments, the individuals in the second population dataset established can be allowed to pass away at a specified death rate and at specified mortality by age parameters as step 170. Further, the individuals in the second population dataset can also be allowed to migrate to and from the second population dataset as step 180. Such a migration however can maintain the population's age and sex distributions and the proportion of married fertile aged individuals in the second population dataset.
-
In some exemplary embodiments, the individuals in the established second population can be allowed to move within the second population dataset as step 190, whereby individuals from a sub-population are selected at random and assigned at random to another sub-population.
-
In some exemplary embodiments, one or more steps of mating, marriage, divorce, reproduction, emigration, death or moving from one subpopulation to another in the second population dataset can be repeated (as step 200) at one year intervals for a pre-determined number of years by applying the steps to the population dataset resulting from the previous reiteration.
-
This framework is flexible enough to apply to modeling shallower ascertainment of more transient populations. Based on the first population dataset, parameters for the second population dataset can be modified to customize the prediction model for any particular geographic area or sub-population.
-
In some embodiments, the prediction model can ascertain individuals randomly from a population. Random ascertainment gives each individual in the population an equal chance of being ascertained without replacement.
-
In some exemplary embodiments, the prediction model can ascertain individuals in a clustered fashion from a population. Clustered sampling can enrich for close relatives by selecting an individual at random with a number of their first- and second-degree relatives.
-
Any of the methods described or exemplified by the disclosure may be practiced as a computer-implemented method and/or as a system. Any suitable computer system known by the person having ordinary skill in the art may be used for this purpose.
-
FIG. 2 illustrates various aspects of an exemplary environment 201 in which the present methods and systems can operate. The present methods may be used in various types of networks and systems that employ both digital and analog equipment. Provided herein is a functional description and that the respective functions can be performed by software, hardware, or a combination of software and hardware.
-
The environment 201 can comprise a Local Data/Processing Center 210. The Local Data/Processing Center 210 can comprise one or more networks, such as local area networks, to facilitate communication between one or more computing devices. The one or more computing devices can be used to store, process, analyze, output, and/or visualize biological data. The environment 201 can, optionally, comprise a Medical Data Provider 220. The Medical Data Provider 220 can comprise one or more sources of biological data. For example, the Medical Data Provider 220 can comprise one or more health systems with access to medical information for one or more patients. The medical information can comprise, for example, medical history, medical professional observations and remarks, laboratory reports, diagnoses, doctors' orders, prescriptions, vital signs, fluid balance, respiratory function, blood parameters, electrocardiograms, x-rays, CT scans, MM data, laboratory test results, diagnoses, prognoses, evaluations, admission and discharge notes, and patient registration information. The Medical Data Provider 220 can comprise one or more networks, such as local area networks, to facilitate communication between one or more computing devices. The one or more computing devices can be used to store, process, analyze, output, and/or visualize medical information. The Medical Data Provider 220 can de-identify the medical information and provide the de-identified medical information to the Local Data/Processing Center 210. The de-identified medical information can comprise a unique identifier for each patient so as to distinguish medical information of one patient from another patient, while maintaining the medical information in a de-identified state. The de-identified medical information prevents a patient's identity from being connected with his or her particular medical information. The Local Data/Processing Center 210 can analyze the de-identified medical information to assign one or more phenotypes to each patient (for example, by assigning International Classification of Diseases “ICD” and/or Current Procedural Terminology “CPT” codes).
-
The environment 201 can comprise a NGS Sequencing Facility 230. The NGS Sequencing Facility 230 can comprise one or more sequencers (e.g., Illumina HiSeq 2500, Pacific Biosciences PacBio RS II, and the like). The one or more sequencers can be configured for exome sequencing, whole exome sequencing, RNA-seq, whole-genome sequencing, targeted sequencing, and the like. In an exemplary aspect, the Medical Data Provider 220 can provide biological samples from the patients associated with the de-identified medical information. The unique identifier can be used to maintain an association between a biological sample and the de-identified medical information that corresponds to the biological sample. The NGS Sequencing Facility 230 can sequence each patient's exome based on the biological sample. To store biological samples prior to sequencing, the NGS Sequencing Facility 230 can comprise a biobank (for example, from Liconic Instruments). Biological samples can be received in tubes (each tube associated with a patient), each tube can comprise a barcode (or other identifier) that can be scanned to automatically log the samples into the Local Data/Processing Center 210. The NGS Sequencing Facility 230 can comprise one or more robots for use in one or more phases of sequencing to ensure uniform data and effectively non-stop operation. The NGS Sequencing Facility 230 can thus sequence tens of thousands of exomes per year. In one aspect, the NGS Sequencing Facility 230 has the functional capacity to sequence at least 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000, 11,000 or 12,000 whole exomes per month.
-
The biological data (e.g., raw sequencing data) generated by the NGS Sequencing Facility 230 can be transferred to the Local Data/Processing Center 210 which can then transfer the biological data to a Remote Data/Processing Center 240. The Remote Data/Processing Center 240 can comprise cloud-based data storage and processing center comprising one or more computing devices. The Local Data/Processing Center 210 and the NGS Sequencing Facility 230 can communicate data to and from the Remote Data/Processing Center 240 directly via one or more high capacity fiber lines, although other data communication systems are contemplated (e.g., the Internet). In an exemplary aspect, the Remote Data/Processing Center 240 can comprise a third party system, for example Amazon Web Services (DNAnexus). The Remote Data/Processing Center 240 can facilitate the automation of analysis steps, and allows sharing data with one or more Collaborators 250 in a secure manner. Upon receiving biological data from the Local Data/Processing Center 210, the Remote Data/Processing Center 240 can perform an automated series of pipeline steps for primary and secondary data analysis using bioinformatic tools, resulting in annotated variant files for each sample. Results from such data analysis (e.g., genotype) can be communicated back to the Local Data/Processing Center 210 and, for example, integrated into a Laboratory Information Management System (LIMS) can be configured to maintain the status of each biological sample.
-
The Local Data/Processing Center 210 can then utilize the biological data (e.g., genotype) obtained via the NGS Sequencing Facility 230 and the Remote Data/Processing Center 240 in combination with the de-identified medical information (including identified phenotypes) to identify associations between genotypes and phenotypes. For example, the Local Data/Processing Center 210 can apply a phenotype-first approach, where a phenotype is defined that may have therapeutic potential in a certain disease area, for example extremes of blood lipids for cardiovascular disease. Another example is the study of obese patients to identify individuals who appear to be protected from the typical range of comorbidities. Another approach is to start with a genotype and a hypothesis, for example that gene X is involved in causing, or protecting from, disease Y.
-
In an exemplary aspect, the one or more Collaborators 250 can access some or all of the biological data and/or the de-identified medical information via a network such as the Internet 260.
-
In an exemplary aspect, illustrated in FIG. 3, one or more of the Local Data/Processing Center 210 and/or the Remote Data/Processing Center 240 can comprise one or more computing devices that comprise one or more of a genetic data component 300, a phenotypic data component 310, a genetic variant-phenotype association data component 320, and/or a data analysis component 330. The genetic data component 300, the phenotypic data component 310, and/or the genetic variant-phenotype association data component 320 can be configured for one or more of, a quality assessment of sequence data, read alignment to a reference genome, variant identification, annotation of variants, phenotype identification, variant-phenotype association identification, data visualization, combinations thereof, and the like.
-
The consecutive labeling of method steps as provided herein with numbers and/or letters is not meant to limit the method or any embodiments thereof to the particular indicated order.
-
Various publications, including patents, patent applications, published patent applications, accession numbers, technical articles and scholarly articles are cited throughout the specification. Each of these cited references is incorporated by reference, in its entirety and for all purposes, herein.
-
The disclosure will be more fully understood by reference to the following Examples, which are provided to describe the disclosure in greater detail. They are intended to illustrate and should not be construed as limiting the scope of the disclosure.
EXAMPLES
Example 1
Patients and Samples
-
Two sets of data were collected by applying the prediction model to cohorts—(A) DicovEHR cohort with exomes of 61,720 de-identified patients and (B) expanded DicovEHR cohort with exomes of 92,455 de-identified patients.
-
All of the de-identified patient-participants in both the cohorts obtained from the Geisinger Health System (GHS) were sequenced. All participates consented to participate in the MyCode® Community Health Initiative (Carey et al. (2016), Genet. Med. 18, 906-913) and contributed DNA samples for genomic analysis in the Regeneron-GHS DiscovEHR Study (Dewey et al. (2016), Science 354, aaf6814-aaf6814). All patients had their exome linked to a corresponding de-identified electronic health record (EHR). A more detailed description of the first 50,726 sequenced individuals has been previously published (Dewey et al. (2016), Science 354, aaf6814-aaf6814; Abul-Husn et al. (2016), Science 354, aaf7000-aaf7000).
-
The study did not specifically target families to participate in the study, but it enriched for adults with chronic health problems who interact frequently with the healthcare system, as well as participants from the Coronary Catheterization Laboratory and the Bariatric Service.
Example 2
Simulations with SimProgeny and Relatedness Projections
-
In an attempt to model, understand, and predict the growth of the relationship networks in the DiscovEHR and the expanded DiscovEHR dataset, a simulation framework (hereinafter “SimProgeny”) was developed, which could simulate lineages of millions of people over hundreds of years dispersed across multiple sub-populations. From these simulated populations, it can model various sampling approaches, and estimate the amount of relatedness researchers should expect to find for a given set of populations and sampling parameters (See Example 6).
-
SimProgeny was used to simulate the DiscovEHR and the expanded DiscovEHR population and the ascertainment of the first 61K and first 92K participants from them, respectively. The simulations show that DiscovEHR and the expanded DiscovEHR participants were not randomly sampled from the population, but rather that the dataset was enriched for close relatives. As shown in FIGS. 2A and 2B, the real data were calculated at periodic “freezes” indicated with the punctuation points connected by the faint line. Samples and relationships identified in the 61K-person freeze were also taken and then shuffled the ascertainment order to demonstrate that the first half of the 61K DiscovEHR participants were enriched for first-degree relationships relative to the second half. Populations of various sizes were simulated using parameters similar to the real population from which DiscovEHR was ascertained. Random ascertainment from each of these populations was then performed to see which population size most closely fit the real data. A key takeaway is that none of these population sizes fit the real data, and the random ascertainment approach is a poor fit. A different ascertainment approach that enriches for first-degree relatives compared to random ascertainment could produce a better fit. FIG. 4A shows that an ascertainment of first-degree relative pairs in an effective sampling population of size 270K closely fit the shuffled version of the real data, but underestimates the number of relative pairs below 61K ascertained participants and dramatically over estimates the number of relative pairs above 61K participants. FIG. 4B shows that a population of 270K most closely fits the shuffled real data with respect to the number of individuals with one or more first-degree relatives, but is a poor fit to the real data.
-
A similar result was observed using the expanded DiscovEHR dataset (FIG. 5A and FIG. 5B). Samples and relationships identified in the 92K-person freeze were then shuffled to demonstrate that the first half of the 92K expanded DiscovEHR participants were enriched for first-degree relationships relative the second half. Random ascertainment from each of these populations was then performed to see which population size most closely fit the real data. FIG. 5A shows that an ascertainment of first-degree relative pairs in an effective sampling population of size 403K closely fit the shuffled version of the real data, but underestimates the number of relative pairs below 92K ascertained participants and dramatically over estimates the number of relative pairs above 92K participants. FIG. 5B shows that a population of 403K most closely fits the shuffled real data with respect to the number of individuals with one or more first-degree relatives, but is a poor fit to the real data.
-
The enrichment of close relatives was modeled by using a clustered ascertainment approach (See Example 6) that produced simulations that better fit the real data for the DiscovEHR (FIG. 6A and FIG. 6B) and the expanded DiscovEHR (FIG. 7A and FIG. 7B). For both FIG. 6 and FIG. 7, the real data was calculated at periodic “freezes” indicated with the punctuation points connected by the faint line. Most simulation parameters were set based on information about the real population demographics and the DiscovEHR ascertainment approach. However, two parameters were unavailable to us and thus unknown and were thus inferred based on fit to the real data: 1) the effective population size from which samples were ascertained and 2) the increased chance that someone is ascertained given a first-degree relative was previously ascertained, which is referred to as “clustered ascertainment”. All panels in FIG. 6 and FIG. 7 show the same three simulated population sizes spanning the estimated effective population size. Clustered ascertainment was simulated by randomly ascertaining an individual along with a Poisson-distributed random number of first degree relatives (distributions' lambdas are indicated in the legends). These simulation results suggest that the effective sampling population size was ˜475K individuals and that a Poisson distribution with a lambda of 0.2 most closely matched the enrichment of first-degree relatives. This was consistent with the understanding that the majority of the current participants reside in a certain local geographical area, such as the Danville, Pa. area (˜500K individuals) in this example, rather than evenly distributed across the entire GHS catchment area (>2.5 million individuals).
-
After simulation parameters were identified that reasonably fit the real data, SimProgeny was used to obtain a projection of the amount of first degree relationships that should be expected as the DiscovEHR and the expanded DiscovEHR study expands to the goal of 250K participants. The results indicated that if ascertainment of participants continued in the same way, obtaining ˜150K first-degree relationships should be expected for DiscovEHR (FIG. 6C) and expanded DiscovEHR (FIG. 7C), involving ˜60% of DiscovEHR participants (FIG. 6D) and involving ˜60% of the expanded DiscovEHR participants (FIG. 7D).
-
The simulation analysis was then expanded to include second-degree relationships, and the simulation results suggested that with 250K participants, well over 200K combined first- and second-degree relationships involving over 70% of the individuals in DiscovEHR (FIG. 8) and expanded DiscovEHR (FIG. 9) should be expected. For this analysis, the real data was calculated at periodic “freezes” indicated with the punctuation points connected by the faint line in the figures. Most simulation parameters were set based on information about the real population demographics and the DiscovEHR ascertainment approach. However, two parameters were unknown and selected based on fit to the real data: 1) the effective population size from which samples were ascertained and 2) the increased chance that someone is ascertained given a first- or second-degree relative previously ascertained, which is referred to as “clustered ascertainment”. All panels in FIG. 8 and FIG. 9 show the same three simulated population sizes spanning the estimated effective population size. Clustered ascertainment was simulated by randomly ascertaining an individual along with a Poisson distributed random number of 1st degree relatives and a separate random number of second degree relatives (both Poisson distributions have a lambda indicated in the figure legends.)
-
The simulation results demonstrated a clear enrichment of relatedness in the DiscovEHR HPG study as well as provided key insights into the tremendous amount of relatedness expected to be seen as ascertainment of additional participants was continued.
Example 3
Sample Preparation, Sequencing, Variant Calling, and Sample QC
-
Data sample preparation and sequencing have been previously described in Dewey et. al (Dewey et al. (2016), Science 354, aaf6814-aaf6814).
-
Upon completion of sequencing, raw data from each Illumina Hiseq 2500 run was gathered in local buffer storage and uploaded to the DNAnexus platform (Reid et al. (2014) 15, 30) for automated analysis. Sample-level read files were generated with CASAVA (Illumina Inc., San Diego, Calif.) and aligned to GRCh38 with BWA-mem (Li and Durbin (2009); Bioinformatics 25, 1754-1760; Li, H. (2013); arXiv q-bio.GN). The resultant BAM files were processed using GATK and Picard to sort, mark duplicates, and performs local realignment of reads around putative indels. Sequenced variants were annotated with snpEFF (Cingolani et al. (2012); Fly (Austin) 6, 80-92) using Ensembl85 gene definitions to determine the functional impact on transcripts and genes. The gene definitions were restricted to 54,214 transcripts that are protein-coding with an annotated start and stop, corresponding to 19,467 genes.
-
Individuals with low-quality DNA sequence data indicated by high rates of homozygosity, low sequence data coverage, or genetically-identified duplicates that could not be verified to be real monozygotic twins were excluded; 61,019 exomes remained for analysis. Additional information on sample prep, sequencing, variant calling, and variant annotation is reported in Dewey et al. (2016), Science 354, aaf6814-1 to aaf6814-10.
Example 4
SimProgeny
-
SimProgeny was developed to simulate a large population as well as a variety of sample ascertainment methods from that population. SimProgeny can simulate populations of millions of people dispersed across one or more sub-populations and track their decedents over hundreds of years. To find a good balance between simplistic and realistic, several key population level parameters were selected that can be adjusted by the user (See Table 1 below). These parameters were selected to provide a good approximation of a real population and familial pedigree structures while keeping the simulation tool relatively simple. Default values are based on US population statistics (US average birth rate from 1960: Department of Health and Human Services, National Center for Health Statistics; US average death rate from 1960: National Center for Health Statistics, U.S. Census Bureau; US average marriage rate from 1960: 100 years of marriage and divorce statistics United States, 1867-1967; US average divorce rate from 1960: 100 years of marriage and divorce statistics United States, 1867-1967; in- and out-migration rates for PA from 2000 reflecting both rural and urban migration; US fertility rates from 1970: Hamilton, B. E., Martin, J. A., Osterman, M. J. K., Curtin, S. C., & Mathews, T. J. (2015), Births: Final data for 2014. National Vital Statistics Reports, 64(12), and Hyattsville, Md.: National Center for Health Statistics; female mortality rates from 2005; death rates postcensal estimates based on the 2000 census, estimated as of Jul. 1, 2005; and male and female marriage rates by age from 2009). The default values are set to work for various cohorts, and these parameters could be easily customized to model different populations by modifying the configuration file included with the SimProgeny code (web resource). See Example 6 for a detailed description of the population simulation process.
-
TABLE 1 |
|
(Simulation parameters and default values used in SimProgeny) |
Parameter | Description | Default value |
|
Birth rate | Births per person per year | 0.0219 |
Death rate | Deaths per person per year | 0.0095 |
Marriage rate | Marriages per person per year | 0.01168 |
Divorce rate | Divorces per person per year | 0.0028 |
Full-sibling rate | Proportion of births to married couples | 0.88 |
Fertility start | Youngest age an individual can reproduce | 15 |
Fertility end | Oldest age an individual can reproduce | 49 or 50 |
in-migration rate | Proportion of annual in-migrating | 0.01 |
out-migration rate | Proportion of annual out-migrating each year | 0.021 |
Fertility by age | Weighting vector for women ages 0 to 50 | ranges 0 to 1 |
Male mortality by age | Weighting vector for men ages 0 to 120 | ranges 0 to 1 |
Female mortality by age | Weighting vector for women ages 0 to 120 | ranges 0 to 1 |
Male marriage by age | Weighting vector for men ages 0 to 50 | ranges 0 to 1 |
Female marriage by age | Weighting vector for women ages 0 to 50 | ranges 0 to 1 |
|
For the framework developed for the DiscovEHR cohort, the fertility end was 49 years and for the framework developed for the expanded DiscovEHR cohort, the fertility end was 50 years.
-
In addition to modeling populations, SimProgeny simulates two ascertainment approaches to model selecting individuals from a population for a genetic study: random ascertainment and clustered sampling. Random ascertainment gives each individual in the population an equal chance of being ascertained without replacement. Clustered sampling is an approach to enrich for close relatives, and can be done by selecting an individual at random along with a number of their first- and second-degree relatives. The number of first-degree relatives is determined by sampling a value from a Poisson distributed with a user specified first-degree ascertainment lambda (default is 0.2). The number of second-degree relatives is determined in the same way and the default second-degree ascertainment lambda is 0.03. See Example 6 for additional information on SimProgeny's ascertainment options.
Example 5
Simulation of the Underlying DiscovEHR Population and its Ascertainment
-
In an effort to not over complicate the simulation model, the simulations contained individual populations with starting sizes of 200K, 300K, 400K, 450K, 500K, 550K, 600K, and 1000K. SimProgeny parameters (See Table 1 above) were tuned with publicly available country, state, and county level. The immigration and emigration rates from the Pennsylvania (PA) average were reduced since GHS primarily serves rural areas, which tend to have lower migration rates than more urban areas. Simulations were run with a burn-in period of 120 years and then progressed for 101 years. Simulated populations grew by ˜15%, which is similar to the growth of PA since the mid-20th century.
-
[80] Both random and clustered ascertainment was performed. For both ascertainment approaches, the ascertainment order of the first 5% of the population (specified with the ordered_sampling_proportion parameter) was shuffled in order to model the random sequencing order of the individuals in GHS biobank at the beginning of the collaboration. While the selection of this parameter has no effect on random ascertainment and a negligible effect on the accumulation of pairwise relationships in clustered ascertainment, it does affect the proportion of individuals with one or more relatives in the dataset that were ascertained with clustered sampling by causing an inflection point, which is more pronounced with higher lambda values. This inflection point could be less pronounced by modeling the freeze process of the real data or modeling a smoother transition between sequencing samples from the GHS biobank and newly ascertained individuals.
Example 6
SimProgeny Population and Ascertainment Simulation Process
-
The simulation began by initializing the user specified number of sub-populations and sizes. Ages were initially assigned between zero and the maximum fertile age (default was 49). Individuals in a population resided in one of three age-based pools: juveniles, fertile, or aged. Individuals were assigned to the sub-population's juvenile pool if they are under the fertility age (default of 15) or assigned to the sub-population's mating pool if within the fertility age range (15 to 49 by default). Individuals were moved from the juvenile pool to the mating pool as they aged above the minimum fertile age. Similarly, they were moved from the mating pool to the aged poll once they aged beyond the max fertile age. Individuals were removed from all age pools if they emigrated or passed away. After establishing an initial population, the simulation performed a burn-in phase of 120 years to establish family relationships and an age distribution that more closely matched the input parameters while requiring equal numbers of births and deaths and a net migration rate of zero. After burn-in, the simulations ran for a specified number of years with the provided population growth and migration rates. The simulations progressed at one-year increments and each year had the following steps that were performed within each sub-population, unless otherwise stated:
-
- 1. Age—move individuals who have aged out of their age pool to the next age pool.
- 2. Court—simulate a single man and a single woman entering into a monogamous marriage. This process is important to obtain a realistic number of full-sibling relationships. Pairs of men and women are chosen at random from the pool of single reproductive aged males and females, and they successfully marry based on their chances of getting married at their age, which are specified by the male and female “marriage by age” parameters. Pairs are drawn until the number of successful marriages is reached as defined by the marriage rate. Couples are restricted to being more distantly related than first-cousins. During the burn-in phase, the marriage rate is double until the user specified initial marriage rate is reached (default is 66% of the fertile pool being married).
- 3. Split—simulate a man and a woman breaking a marriage at the specified divorce rate. Couples are chosen at random and both individuals are marked as single.
- 4. Mingle—simulate all of the reproduction that may take place within a population for one year. Mother/father pairs are chosen at random from either the single reproductive age pool or the married pool in a ratio defined by the full-sibling rate (default is 88% of all births being to married couples). Pairs are drawn and reproduction attempts are made until the target number of successful conceptions is reached (default birthrate is 0.0219 births per person). The chance that a successful conception occurs is based on the prospective mother's age and corresponding fertility rate. Parents are restricted to being more distantly related than first cousins, and all individuals are limited to having one child per year.
- 5. Cull—simulate individuals passing away. The death rate (default is 0.0095 deaths per person) is used to determine the expected number of deaths within the population in a given year. The male and female mortality by age parameters are used to weight the chance a randomly selected individual will pass away. If the random number between 0 and 1 exceeds the person's probability of dying at his/her age, then the individual is retained and another individual is selected. Unfortunate individuals are added to the departed pool and removed from any other pools of the living. All individuals who are older than 120 are automatically added to the departed pool and count towards the target number of deaths for the year.
- 6. Migrate—simulate migration to and from the population. Emigration is performed by randomly selecting an individual from the population and removing him/her from the population along with his/her spouse if married. The proportion of juvenile and aged individuals leaving is recorded along with the number of fertile aged married couples. Immigration is done in a way to maintain the age distributions and the number of fertile aged married couples. First, a juvenile is randomly selected from the existing population and a new individual of the same sex and age is added to the juveniles pool, and this process is repeated until the appropriate proportions of juveniles have been added. As people were removed during the migrate step, the proportion of juveniles in the population removed was recorded. Same number of juvenile removed were added back into the population. For example, if 100 people (including 20 juveniles) were removed, and only 10 people were added, then 2 of those 10 people would be juveniles. The same process is repeated for aged individuals. Next, two fertile aged individuals are selected from the existing population, and two new individuals are added with corresponding ages. One is assigned to be male and the other female and the two immigrants are then married. This step is repeated until the number of married couples has been replenished. Finally, fertile aged individuals are added in the same process used to add new juveniles, and it is repeated until the target number of immigrants is achieved. This process helps maintain the population's age and sex distributions as well as the proportion of married fertile aged individuals.
- 7. Transplant—simulate people moving within sub-populations. To simulate the lack of genetic isolation between sub-populations, individuals can move between sub-populations within the overall population. A single rate of movement is used across the entire population. Individuals from a sub-population are selected at random and assigned at random to one of the other sub-populations until the desired number of transplants is achieved. This step does not occur if there is only one sub-population or if the transplant rate is 0 (default is 1% of the overall population transplants each year). The simulation progresses for the specified length of time keeping track of each founder and their descendants.
-
[82] Both random and clustered ascertainment was performed. For both ascertainment approaches, the ascertainment order of the first 5% of the population (specified with the ordered_sampling_proportion parameter) was shuffled in order to model the random sequencing order of the individuals in GHS biobank at the beginning of our collaboration. While the selection of this parameter had no effect on random ascertainment and a negligible effect on the accumulation of pairwise relationships in clustered ascertainment, it did affect the proportion of individuals with one or more relatives in the dataset that were ascertained with clustered sampling by causing an inflection point, which was more pronounced with higher lambda values. This inflection point could have been less pronounced if one were to model the freeze process of the real data or model a smoother transition between sequencing samples from the biobank and new ascertained individuals. For example, users could specify the sub-population ascertainment order in the case they want to simulate ascertaining from one or more sub-populations before moving onto the next set of sub-populations. The default was to initially group all subpopulations and ascertain from them as if they had been a single population. Users could also specify the initial proportion of a population that was ascertained before moving onto other sub-populations or the overall population. The program established an output for the entire population in ped file format, the list of ascertained samples in the order they were ascertained, and several results files summarizing useful population and ascertainment statistics.
-
Such a forward simulation framework (SimProgeny) can be used to simulate a wide variety of populations, including a population served by a healthcare system like GHS (as exemplified above). It can also simulate sample ascertainment used by HPG studies. There are several factors that can drive the amount of relatedness in an ascertained dataset (FIG. 10)
-
Further, such a model can simulate populations of millions of people dispersed across one or more sub-populations on the basis of user-specified population parameters (see Table 1 above). Progressing year to year, the simulation creates couplings, births, separations, migrations, deaths, and movement between sub-populations on the basis of specified parameters. This process generates realistic pedigree structures and populations that represent a wide variety of HPG studies. The default values have been tuned so that the simulated population models the DiscovEHR cohort and expanded DiscovEHR cohort, but one can easily customize these parameters to model different populations by modifying the configuration file included with the SimProgeny code.