US20210020266A1 - Phase-aware determination of identity-by-descent dna segments - Google Patents

Phase-aware determination of identity-by-descent dna segments Download PDF

Info

Publication number
US20210020266A1
US20210020266A1 US16/947,107 US202016947107A US2021020266A1 US 20210020266 A1 US20210020266 A1 US 20210020266A1 US 202016947107 A US202016947107 A US 202016947107A US 2021020266 A1 US2021020266 A1 US 2021020266A1
Authority
US
United States
Prior art keywords
ibd
haplotype
segments
sites
potential
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/947,107
Other languages
English (en)
Inventor
William A. Freyman
Kimberly F. McManus
Suyash S. Shringarpure
Ethan M. Jewett
Adam Auton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
23andMe Inc
Original Assignee
23andMe Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 23andMe Inc filed Critical 23andMe Inc
Priority to US16/947,107 priority Critical patent/US20210020266A1/en
Assigned to 23ANDME, INC. reassignment 23ANDME, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FREYMAN, WILLIAM A., JEWETT, ETHAN M., MCMANUS, KIMBERLY F., AUTON, ADAM, SHRINGARPURE, SUYASH S.
Publication of US20210020266A1 publication Critical patent/US20210020266A1/en
Priority to US17/249,520 priority patent/US20210193257A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism

Definitions

  • IBD estimates can best be exploited when they are made using phased haplotypes; this means each individual in the data set is represented by two sequences each of which consists of alleles co-located on the same chromosome and inherited from a different parent. IBD estimates that are phase aware can improve relationship and pedigree inference, allow health and trait inheritance to be traced, and make possible a range of other inferences regarding demographic history and ancestry that are not possible when IBD estimates are made using only unphased genotype data. Therefore, methods and systems that can improve performance of phase aware IBD estimates have significant value.
  • the disclosed implementations concern methods, apparatus, systems, and computer program products for processing haplotype data to accurately estimate IBD segments between individuals.
  • a first aspect of the disclosure provides computer-implemented methods for estimating IBD segments between individuals.
  • the system involves: a sequencer for sequencing nucleic acids of the test sample; a processor; and one or more computer-readable storage media having stored thereon instructions for execution on said processor to estimate IBD segments between individuals.
  • Another aspect of the disclosure provides a computer program product including a non-transitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to implement the methods above for estimating IBD segments.
  • a computer implemented method of processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes including: providing a first digital template including a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites; providing a second digital template including a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different; providing two or more haplotypes strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying the second digital template to the two or more haplotype
  • the first and second templates each have a size of at least four consecutive haplotype sites.
  • identifying the first matches among alleles at unmasked sites includes sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings.
  • computationally identifying IBD segments between the two or more haplotype strings further includes: computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and wherein merging the first and second matches among alleles to produce a merged set of IBD segments further includes computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments.
  • computationally identifying additional matches among alleles at unmasked sites employs a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template.
  • the first through sixth digital templates each include two masked sites and two unmasked sites.
  • the first digital template and the second digital template each have a ratio of masked sites to unmasked sites of between about 2:1 to about 1:2.
  • the two or more haplotype strings include at least one thousand haplotype strings. In some embodiments, the two or more haplotype strings include at least one million haplotype strings.
  • computationally identifying IBD segments between the two or more haplotype strings includes performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings. In some embodiments, computationally merging the first and second matches among alleles is performed while considering individual polymorphic sites of the two or more haplotype strings using the PBWT. In some embodiments, the total number of digital templates is between 2 and k, where k is the number of haplotype sites in the window.
  • PBWT positional Burrows-Wheeler transform
  • the total number of digital templates is k!/(m!*(k ⁇ m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • applying the first digital template comprises a deterministic process employing the first arrangement of masked and unmasked sites.
  • a system for processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes including: (a) one or more processors and associated memory; (b) computer readable instructions for: providing a first digital template including a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites; providing a second digital template including a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different; providing two or more haplotypes strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches
  • the first and second templates each have a size of at least four consecutive haplotype sites.
  • the instructions for identifying the first matches among alleles at unmasked sites includes instructions for sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings.
  • the instructions for identifying IBD segments between the two or more haplotype strings further include instructions for: computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and wherein merging the first and second matches among alleles to produce a merged set of IBD segments further includes computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments.
  • the instructions for identifying additional matches among alleles at unmasked sites employ a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template.
  • the first through sixth digital templates each include two masked sites and two unmasked sites.
  • the first digital template and the second digital template each have a ratio of masked sites to unmasked sites of between about 2:1 to about 1:2.
  • the two or more haplotype strings include at least one thousand haplotype strings. In some embodiments, the two or more haplotype strings include at least one million haplotype strings.
  • the instructions for identifying IBD segments between the two or more haplotype strings include instructions performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings. In some embodiments, the instructions for merging the first and second matches among alleles include instructions for performing the merging while considering individual polymorphic sites of the two or more haplotype strings using the PBWT.
  • PBWT positional Burrows-Wheeler transform
  • the total number of digital templates is between 2 and k, where k is the number of haplotype sites in the window. In some embodiments, the total number of digital templates is k!/(m!*(k ⁇ m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • a method of identifying IBD segments between two or more haplotypes strings, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism including: (a) computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of two or more haplotype strings at unmasked sites produced by applying a first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying a second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the first digital template includes a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the second digital template includes a second arrangement of masked and unmasked sites in the window of consecutive
  • a computer implemented method of determining identity by descent (IBD) segments including: determining first potential IBD segments among phased haplotype data for a plurality of individuals, wherein the first potential IBD segments have an end site; determining second potential IBD segments among haplotype data for the plurality of individuals, wherein the second potential IBD segments have a start site; determining that the end site of the first potential IBD segments and the start site of the second potential IBD segments are within a threshold number of sites of each other; and merging the first potential IBD segments and the second potential IBD segments to form a combined potential IBD segment.
  • IBD identity by descent
  • the first potential IBD segments and the second potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the method further includes: determining a phase switch error occurred at a site between the first potential IBD segment and the second potential IBD segment for the individual; and swapping the haplotypes for the individual from the position of the phase switch error.
  • the first potential IBD segments and the second potential IBD segments overlap for an individual of the plurality of individuals.
  • the first potential IBD segment and the second potential IBD segment each span at least the threshold number of sites. In some embodiments, the threshold number of sites is between about 0 and 500 SNPs.
  • the plurality of individuals do not share a parent-child relationship.
  • the method further includes: determining a third potential IBD segments among phased haplotype data for a plurality of individuals, wherein the third potential IBD segments have a start site; determining that the end site of the combined potential IBD segments and the start site of the third potential IBD segments are within the threshold number of SNPs of each other; and merging the combined potential IBD segments and the third potential IBD segments.
  • the combined potential IBD segments and the third potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the method further includes: determining a phase switch error occurred at a site between the combined potential IBD segment and the third potential IBD segment for the individual; and swapping the haplotypes for the individual from the position of the phase switch error. In some embodiments, the method further includes determining that the combined potential IBD segments have a minimum length in centimorgans and storing the combined potential IBD segments as IBD segments for the plurality of individuals.
  • a computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes including: providing two or more paired haplotypes strings for identification of IBD segments therebetween, each of the two or more paired haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and computationally iterating through the two or more paired haplotype strings by: (i) identifying a first potential IBD segment between the two or more haplotype strings by identifying matches among alleles of the haplotype strings; (ii) comparing the first site of the first potential IBD segment to the last site of a previously identified second potential IBD segment (iii) determining that the last site of the second potential IBD segment and the first site of the first potential IBD segment are within a threshold number of sites of each other; and (iv) merging the first potential IBD segment and the second potential IBD segment to form a combined potential I
  • a computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes including: (a) computationally identifying initial IBD segments between two or more haplotype strings by identifying first matches among alleles of the haplotype strings using a plurality of templates, each including a unique arrangement of masked and unmasked sites in a window of consecutive haplotype sites; and (b) providing information characterizing the initial IBD segments to a hidden Markov model (HMM) which removes potential phase switch errors to produce final IBD segment, wherein the HMM analyzes the information characterizing the initial IBD segments using distances between consecutive haplotype sites on a chromosome, one or more rates of recombination based on meiosis, and one or more rates of phase switch error based on a phasing method employed to phase the haplotypes.
  • HMM hidden Markov model
  • the method further includes, after (a) and before (b), removing some initial IBD segments determined to belong to haplotypes having less than a threshold amount of initial IBD segments, wherein the initial IBD segments provided to the HEIM in (b) have had some initial IBD segments removed.
  • the threshold amount of initial IBD segments is less than two initial IBD segments per chromosome.
  • a computer implemented method of determining identical-by-descent (IBD) segments including: (a) for each polymorphic site in a series of polymorphic sites of two individuals, obtaining an IBD state that indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and, if so, which of the two individuals' phased haplotypes are part of the IBD segment, wherein the series of polymorphic sites are included in one or more pairs of chromosomes; and (b) applying a hidden Markov model (HMM) to the IBD states to produce one or more error-corrected IBD segments, wherein the HMM model takes as input, in addition to the IBD states as observed IBD states, (i) a rate of recombination based on a number of meioses, and (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes; wherein applying the HMM model takes as input, in addition to the IBD states as
  • the HMM takes as input: (iii) genetic distances between consecutive sites on a chromosome.
  • transition rates of the HMM are based on a rate at which IBD segments start, which rate is modeled as a function of the number of meioses.
  • the rate at which IBD segments start ( ⁇ s ) is modeled as follows:
  • transition rates of the HMM are based on a rate at which IBD segments end.
  • the rate at which IBD segments end is modeled as a function of the number of meioses.
  • the rate at which IBD segments end ( ⁇ e ) is modeled as follows:
  • the IBD states include nine different IBD states, which indicate nine conditions of zero IBD, half IBD, and full IBD.
  • transition rates of the HMM are based on a transition matrix Q a in FIG. 8B .
  • transition rates of the HMM are weighted by a probability that full IBD between the two individuals is truly present.
  • the probability that full IBD between the two individuals is truly present is modeled as a logistic function of an amount of estimated full IBD.
  • the probability that full IBD between the two individuals is truly present ( ⁇ ) is modeled as follows:
  • the transition rates are weighted by weighting transitions into full IBD states with ⁇ , and weighting transitions out of full IBD states with 1/ ⁇ .
  • the IBD states include 9 different IBD states, and the transition rates are based on a transition matrix Q ⁇ in (Eq. 5).
  • transition rates of the HMM are based on the at least one rate of phase switch error.
  • the at least one rate of phase switch error includes a rate of phase switch error for each of the two individuals, there are 4 types of phase switch errors, the IBD states include 9 different IBD states, and the transition rates are based on a 36 ⁇ 36 transition matrix Q in (Eq. 6). In some embodiments, transition probabilities of the HMM are based on the genetic distances between consecutive sites on a chromosome.
  • the transition probabilities are obtained by exponentiating a transition matrix.
  • transition probabilities of hidden IBD states Y l+1 given hidden IBD states Y l are modeled as: P(Y l+1
  • Y l , m, ⁇ 0 , ⁇ 1 , ⁇ 2 ) e Qd l wherein m is the number of meioses, ⁇ 0 is a phase switch error rate for a first individual of the two individuals, ⁇ 1 is a phase switch error rate for a second individual of the two individuals, ⁇ 2 is an amount of estimated full IBD, Q is a transition matrix described by Eq. (Q), and d l is the genetic distances between sites l and l+1.
  • emission probabilities of the HMM are dependent on phase switch errors.
  • the emission probabilities are defined by a uniform error term that weights probabilities of observed IBD states based on four different ways the two individuals may be in phase switch errors.
  • (b) includes using transition probabilities and emission probabilities of the HMM to identify the most likely sequence of hidden IBD states given the observed states.
  • the mostly likely sequence of hidden IBD states is identified using a Viterbi dynamic programming process.
  • the method further includes: performing (a) and (b) for a plurality of iterations, each iteration using a different number of meioses for the rate of recombination, thereby producing a plurality of sets of error-corrected IBD segments; and selecting a set of error-corrected IBD segments having a highest likelihood or probability in the plurality of sets as a final estimate of one or more IBD segments.
  • the different numbers of meioses are in a range from 1 to 14.
  • the method is initiated when the two individuals' IBD segments including the series of polymorphic sites meet a criterion.
  • the two individuals' IBD segments include two or more IBD segments on a single chromosome.
  • the two individuals' IBD segments exceed a minimum total amount of shared IBD
  • FIG. 1 presents possible phase switch errors in long IBD segments.
  • FIGS. 2A and 2B present example process flows according to various embodiments discussed herein.
  • FIG. 3 presents an example process for finding IBD segments.
  • FIG. 4 illustrates haplotype strings, a positional prefix array, and a divergence array in accordance with various embodiments discussed herein.
  • FIGS. 5A and 5B present process flows according to various embodiments herein.
  • FIG. 5C presents pseudocode for an algorithm according to various embodiments herein.
  • FIG. 5D presents an illustration of the various data structures used in accordance with a Templated PBWT process as described herein.
  • FIGS. 6A and 6C present process flows for a phase switch error correction heuristic according to various embodiments herein.
  • FIG. 6B presents various types of phase switch errors that may occur between two pairs of haplotypes.
  • FIG. 7 illustrates using a HMM to process four haplotypes of two individuals to correct phase switch errors.
  • FIG. 8 presents a flow diagram for correcting phase switch errors in IBD segments using a HMM according to various embodiments.
  • FIG. 9A illustrates the structure of an example HMM model.
  • FIG. 9B illustrates a transition matrix that may be used as part of a HMM according to various embodiments herein.
  • FIG. 10 presents a functional diagram of a computer system for performing various embodiments disclosed herein.
  • FIG. 11 presents a block diagram of an IBD-based personal genomics service according to various embodiments herein.
  • FIG. 12 presents a plot comparing the speed of various IBD inference methods.
  • FIGS. 13-16 present plots comparing the IBD estimate errors of various methods.
  • FIGS. 17 and 18 present plots of errors of various methods for various simulated pairs of haplotypes.
  • FIG. 19 presents plots of the false positive and false negative rates of various methods.
  • FIGS. 20A and 20B illustrate the runtime of various methods and the parameters used for assessing runtime.
  • FIGS. 21 and 22 present illustrates of IBD haplotype sharing across Mexican states as determined by a Templated PBWT method as described herein.
  • the disclosure concerns methods, apparatus, systems, and computer program products for estimating IBD segments between individuals using haplotype data.
  • nucleic acids are written left to right in 5′ to 3′ orientation and amino acid sequences are written left to right in amino to carboxy orientation, respectively.
  • plurality refers to more than one element.
  • the term is used herein in reference to a number of nucleic acid molecules or sequence reads that is sufficient to identify significant differences in repeat expansions in test samples and control samples using the methods disclosed herein.
  • a DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment.
  • An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals.
  • DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.
  • nucleic acid refers to a covalently linked sequence of nucleotides (i.e., ribonucleotides for RNA and deoxyribonucleotides for DNA) in which the 3′ position of the pentose of one nucleotide is joined by a phosphodiester group to the 5′ position of the pentose of the next.
  • the nucleotides include sequences of any form of nucleic acid, including, but not limited to RNA and DNA molecules such as cell-free DNA (cfDNA) molecules.
  • cfDNA cell-free DNA
  • polynucleotide includes, without limitation, single- and double-stranded polynucleotides.
  • parameter herein refers to a numerical value that characterizes a physical property. Frequently, a parameter numerically characterizes a quantitative data set and/or a numerical relationship between quantitative data sets. For example, a ratio (or function of a ratio) between the number of sequence tags mapped to a chromosome and the length of the chromosome to which the tags are mapped, is a parameter.
  • a site refers to a unique position (i.e. chromosome ID, chromosome position and orientation) on a reference genome.
  • a site may be a residue, a sequence tag, or a segment's position on a sequence.
  • the specific quantitative value may be based on multiple other quantities, not just the one identified.
  • chromosome refers to the heredity-bearing gene carrier of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones). The conventional internationally recognized individual human genome chromosome numbering system is employed herein.
  • IBD identical-by-descent
  • phase aware IBD segments is challenging not only because of the large sizes of the genomic data sets but also due to two types of error that break up IBD segments: genotyping error and phase switch/phasing error. Because IBD segments are broken up by meiotic recombination events they are expected to be longer for close relatives. However, long IBD segments are more likely to be impacted by genotype and phasing errors compared to short segments. Thus, errors are particularly problematic when detecting IBD among individuals that are closely related (e.g. first, second, and third degree relatives) since long IBD segments are more likely to be fragmented by these errors. This makes accurate inference of phase aware IBD among close relatives particularly problematic.
  • Genotyping error is error introduced via genotyping (e.g., sequencing) in which an allele that is actually of one type (e.g., a T) gets called a different allele (e.g., an A). This commonly has the effect of prematurely terminating a sequence of matches that would otherwise be long enough to be considered an IBD segment.
  • Phase switch error is error introduced by phasing: maternal and paternal copies are reversed. See FIG. 1 . In some typical cases, statistical phasing processes introduce a phase switch error, on average, about every 12 centimorgans.
  • FIG. 1 illustrates phase switch error in fragment long IBD segments.
  • two IBD segments are shared on a single chromosome in two related individuals (top and bottom).
  • phase switch errors (illustrated by vertical dashed lines) occur at different positions along the chromosome in the two individuals, fragmenting the true IBD segments into many erroneous false IBD segments.
  • the individual represented by the top two haplotypes has six phase switch errors and the individual represented by the bottom two haplotypes has four phase switch errors.
  • phased haplotype data is processed using a method based on the positional Burrows-Wheeler transform (PBWT) and a probabilistic hidden Markov model (HMM).
  • phased haplotype data is processed using a method based on the PBWT and a heuristic to correct phase switch errors.
  • IBD segment finding methods may minimize genotyping and phasing error using one or more of the following techniques:
  • an IBD segment computation procedure passes multiple digital templates over phased haplotypes to differentially mask haplotype sites, thereby temporarily ignoring different sites of potential genotype error.
  • the IBD segments being generated by the different templates are combined to effectively remove fragmentation caused by genotyping error.
  • the procedure is a templated positional Burrows-Wheeler transform. This is the positional Burrows-Wheeler transform (PBWT; Durbin 2014) with substantial modifications to handle genotyping errors and missing data.
  • IBD segments that are likely fragmented by phase switch errors introduced during statistical phasing are addressed using a heuristic that recognizes likely occurrences of phase switch errors and/or a probabilistic model that accounts for error rates of the phasing techniques.
  • the model is applied to haplotype/IBD segment data generated using the templating approach described in 1.
  • Some implementations apply a hidden Markov model (HMM) that accounts for both recombination and the phase switching process to reduce these errors.
  • HMM passes along the chromosomes of the two individuals “stitching” fractured IBD segments together.
  • a heuristic is applied that identifies fractured IBD segments based on potential IBD segments that start or end within a distance of the start or end of another IBD segment.
  • the heuristic may stitch the IBD segments together by, for example, swapping haplotype segments in a target individual.
  • the heuristic may be applied as potential IBD segments are generated using the templating approach described in 1 without a separate iteration over the IBD segment data.
  • the heuristic assumes that IBD segments within a threshold distance of each other are likely to be a single IBD segment fractured by one or more phase switch errors.
  • phased sequence data for two or more individuals who are to be compared to detect IBD segments. See block 203 .
  • phased data for many more individuals is obtained.
  • hundreds, thousands, or even millions of individuals have their phased haplotype data compared using methods disclosed herein.
  • data for greater than about 10 million individuals' phased haplotype data are compared.
  • the operation 203 involves identifying individuals or haplotypes on which an IBD comparison is to be run.
  • the phased data includes two strings of haplotype data for each individual (one per chromosome). In other words, there are four haplotypes to be considered for two individuals.
  • Phased haplotype data may be obtained from various sources including statistical techniques such as BEAGLE, FINCH, EAGLE, and other known techniques.
  • An example discussion of phasing techniques is presented in U.S. Pat. No. 9,836,576, filed Mar. 13, 2013, and incorporated herein by reference in its entirety.
  • the haplotype data may be represented as strings of allele values (e.g., 1s and 0s) for sites in the haplotype, each of which is the site of a polymorphism. Each such site may be referred to as an index in the haplotype string.
  • the process assumes that each haplotype site is a biallelic site on a chromosome. It may be given a value of, e.g., 0 for one allele and a value of 1 for the other allele.
  • a typical chromosome may provide hundreds of thousands of sites.
  • Each haplotype may be given its own unique identifier, which may be arbitrarily set.
  • the phased haplotype data is provided to a first processing block as illustrated by a block 205 .
  • This processing may reduce the fragmenting impact of genotyping error in IBD segment finding.
  • multiple operations are performed in parallel, sequentially, or in some combination thereof.
  • significant computational efficiency is realized by performing these operations together for a given haplotype (e.g., in an inner loop of a software routine as illustrated in the sample code below).
  • the operations performed in the first phase include the following: (a) applying digital templates with masked and unmasked positions to exclude certain haplotype sites along the length of the haplotype, (b) identifying matching allele values at unmasked positions along the haplotypes to identify putative IBD segments for the various digital templates, and (c) merging the resulting IBD segments (e.g., as they are being generated) from the various digital templates.
  • the digital templates are constructed as small windows that can be “slid” or “ratcheted” along the length of the haplotype strings, considering consecutive sub-segments of the haplotype sites as they go. Criteria to consider in choosing template structures are the length of the template, the number of masked or null sites in the template, and the arrangement of masked and unmasked sites in the template. Typically, full sets of templates are used in the process that contain all possible arrangements of masked and unmasked sites in a template length. An example set of four site digital templates, each employing two masked and two unmasked positions is described below. Of course, the process may alternatively employ larger (or smaller) templates and/or use templates having a higher or lower proportion of null positions per template.
  • the output of the first processing block implemented in operation 205 is a set of IBD segments or other haplotype matching data for combinations of the various individuals whose phased haplotype data is processed. This data is then passed to a second processing block for processing as illustrated in an operation 207 .
  • a goal of this second operation may involve reducing the fragmenting impact of phase switch error in IBD segment finding.
  • the haplotype/IBD data is subjected to a probabilistic model that accounts for recombination rates based on meioses, which vary based on degree or relatedness of any two individuals, and rates of phase switch errors introduced by the phasing technique(s) employed to generate the phased haplotype data.
  • the model may also account for other inputs such as the genetic distance between adjacent sites on the haplotype and/or the probability of having full IBD state.
  • a hidden Markov model may be used to implement the probabilistic model.
  • An optional last operation of the process 201 involves presenting the processed IBD information in a way that can show the degree of relatedness of the two or more haplotypes that are compared. See the operation represented in block 209 .
  • a process 251 begins by initially obtaining phased sequence data for two or more individuals who are to be compared for identifying IBD segments. See block 253 .
  • phased data for many more individuals may be obtained, e.g., hundreds to millions of individuals.
  • the operation of block 253 involves identifying individuals and/or haplotypes on which an IBD comparison is to be run.
  • Process 251 Computational aspects of process 251 include sequentially considering haplotype positions for genotype errors and phase switch errors within individual haplotypes while keeping track of match segments between haplotypes. Processing each new haplotype position is initiated at a process operation 254 , which selects the next position in the haplotypes under consideration.
  • a first processing operation 255 considers possible errors in the individual haplotypes using multiple templates such as those described elsewhere herein. Haplotype position matches are determined using these templates, and, from these results, an overall decision on matching segments is made. In some implementations, the resulting match segments have reduced genotyping error.
  • the haplotype position under consideration is then analyzed in a processing operation as illustrated in block 257 .
  • a result of this operation may involve reducing the fragmenting impact of genotyping error in IBD segment finding.
  • the haplotype/IBD data may be analyzed by one or more phase switch heuristics and/or models that identify situations where phase switch errors are likely to have occurred.
  • a heuristic may identify situations where one or more IBD segments between individuals end at a first position and then new IBD segments begin at a second position within a threshold distance from the first position.
  • An identified likely phase switch error may be corrected by joining the IBD segments in an individual identified to possess the likely phase switch error. In some cases, error is corrected by swapping haplotype segments within the identified individual.
  • a result of the operation may involve reducing the fragmenting impact of phase switch error in IBD segment finding.
  • An optional last operation of the process 251 involves presenting the processed IBD information in a way that can show the degree of relatedness of the two or more haplotypes that are compared. See the operation represented in block 259 .
  • a reference sequence may be the sequence of a whole genome, the sequence of a chromosome, the sequence of a sub-chromosomal region, etc. From a computational perspective, repeats create ambiguities in alignment, which, in turn, can produce biases and errors even at the whole chromosome counting level. Paired end reads coupled with adjustable insert length in various embodiments can help to eliminate ambiguity in alignment of repeating sequences and detection of repeat expansion.
  • a goal of the process is to use alignment of multiple haplotypes to determine genetic relationship(s) between two or more individuals, or in some cases that potentially involve inbreeding, within a single individual.
  • the process determines relationships between two haplotypes. IBD may be used for this purpose.
  • IBD can be understood in the context of meiosis and recombinable DNA. Because of recombination and independent assortment of chromosomes, the autosomal DNA and X chromosome DNA (collectively referred to as recombinable DNA) from the parents is shuffled at the next generation, with small amounts of mutation. Thus, only relatives will share long stretches of genomic regions where their recombinable DNA is completely or nearly identical. Such regions are referred to as “identical-by-descent” (IBD) regions because they arose from the same DNA sequences in an earlier generation/common ancestor.
  • IBD identical-by-descent
  • locating IBD regions includes sequencing the entire genomes of the individuals and comparing the genome sequences. In some embodiments, locating IBD regions includes assaying a large number of markers that tend to vary in different individuals and comparing the markers. Examples of such markers include Single Nucleotide Polymorphisms (SNPs), which are points along the genome with two or more variations; e.g., Short Tandem Repeats (STRs), which are repeated patterns of two or more repeated nucleotide sequences adjacent to each other; and Copy-Number Variants (CNVs), which include longer sequences of DNA that could be present in varying numbers in different individuals. Long stretches of DNA sequences from different individuals' genomes in which markers in the same locations are the same indicate that the rest of the sequences, although not assayed directly, are also likely identical.
  • SNPs Single Nucleotide Polymorphisms
  • STRs Short Tandem Repeats
  • CNVs Copy-Number Variants
  • PBWT positional Burrows-Wheeler transform
  • a PBWT process is implemented according to the following description. Initially, each haplotype under consideration is given its own unique identifier, which may be arbitrarily set. Then, during execution, the method steps through the sites of all haplotypes under consideration, position-by-position, starting at a first position, which may be identified as position 0. As the method steps through the haplotype sites, it keeps track of two arrays, which are updated for every position (index) in the haplotypes. Also, during a pass through the haplotype sites, a templated PBWT process may apply one, some, or all of the digital templates at each position.
  • the first array is a “positional prefix array” that contains a list of all haplotypes under consideration. It is populated with IDs of all the haplotypes. A separate instance of the positional prefix array is produced each time a new site is encountered while traversing the haplotype string. Over the course of the process, and while certain haplotypes have identical allele values from one position to the next, these haplotypes are grouped together in the positional prefix array. In other words, haplotypes having matching allele values, remain together (in the same block) within the positional prefix array for as long as their alleles match. By keeping the haplotypes together while alleles match, the positional prefix array contains information about putative IBD segments.
  • the second array is a “divergence array” that indicates where matches between any two haplotypes under consideration began. It reflects how many positions/sites back in the haplotype string until there was a difference. In other words, this matrix keeps track of the last time that two haplotypes did not match by, e.g., providing the index value of the last mismatch for any two haplotypes.
  • FIG. 3 An example of a general IBD segment finding process 301 is depicted in FIG. 3 . As illustrated, the process begins by receiving haplotype strings representing allele values across all haplotypes to be considered in the process. See block 303 . This may correspond to block 203 in FIG. 2A and/or block 253 of FIG. 2B .
  • the process lists all haplotypes in the positional prefix array. It may do this randomly or in some order, but typically it does not yet account for the allele values at any haplotype site.
  • the individual haplotypes may be listed by unique identifiers.
  • the values in divergence array are all set 0 because there are no previous sites that have been considered.
  • the array initializations are illustrated by operation 305 in process 301 of FIG. 3 .
  • the process goes through all haplotypes in the order of the positional prefix array (which may initially be random or otherwise arbitrary) and orders the haplotypes such that those that have a first allele value (e.g., 0) in the current position are grouped together at the top, and all that have a second allele value (e.g., 1) in the position are grouped together at the bottom. See operation 309 .
  • first allele value e.g., 0
  • second allele value e.g., 1
  • this operation produces a new positional prefix array in which all haplotype indexes that have a 0 at the current position are grouped together in the array, and all haplotype indexes that have a 1 at the position are grouped together.
  • grouped together it is meant that haplotype identifiers are provided in adjacent positions in the positional prefix array. This is illustrated in FIG. 4 which shows a group of six haplotype strings and the associated positional prefix array at a few haplotype sites. Obviously, a typical haplotype has many more sites than illustrated for the haplotypes in FIG. 4 . Further, a typical IBD segment finding process may simultaneously evaluate many more haplotype strings (e.g., hundreds, thousands, or millions).
  • the process notes that all potential IBD segments begin at site 0 and therefore they effectively have a mismatch at position 0. Therefore, the first entry in the divergence array is all zeros. See operation 311 in process 301 and the first column in the divergence array of FIG. 4 .
  • the order of haplotypes in the divergence array is the same as in the positional prefix array.
  • the values in the divergence array are, for currently matching haplotypes, the sites (index values) of the first matching position between the two adjacent haplotypes within the array.
  • the value in the divergence array is the first matching position of the current segment.
  • the method assigns the next site to the new div array even though it has not peeked ahead and checked if the two haplotypes actually match at the next site. If, in the next iteration, the method learns that the segments still do not match, the relevant value in the divergence array simply gets updated again.
  • the process again goes through the haplotypes and again rearranges the haplotype identifiers in the positional prefix array so that those having the same allele value at the current position are grouped together, e.g., all haplotypes having a 0 allele value in the current position are grouped at the top of the array and all those having a 1 allele value are grouped at the bottom.
  • Haplotype strings that have the same alleles over multiple consecutive positions stay near one another in the positional prefix array.
  • the divergence array uses the new arrangement of haplotypes (from the positional prefix array), flags any mismatches between adjacent haplotypes and the current position and inserts the next haplotype site number for mismatching pairs.
  • the next site number is the location of the next possible start position for a new match segment.
  • element i in the divergence array indicates when a current segment match began between the haplotype at ppa[i] and the haplotype at ppa[i ⁇ 1].
  • the positional prefix array has the following values:
  • haplotypes 2 and 3 have a match that extends from the beginning of the alignment (position 0) to the current position (position 5).
  • Haplotype 1 matches with haplotype 3 from position 3 to position 5 (which also implies haplotype 1 and haplotype 2 have the same match).
  • haplotype 4 matches with haplotype 1 from position 2 to position 5 (which also implies haplotype 4 matches haplotypes 1, 2, and 3 between positions 3 and 5).
  • the routines use the alleles at the current position in the alignment to construct the divergence array for the next position.
  • the routine inserts 6 into the position of the haplotypes in the divergence array. Note that the method does not check whether the two haplotypes actually match at position 6, which is why the divergence array does not always contain the beginning position of matches. Once the method actually visits position 6 this value will be overwritten if the haplotypes do not match at site 7. The method continues in this fashion (overwriting values in the divergence array) until actual matches are found.
  • the process flags the two haplotypes as having a potential IBD segment. In the example of process 301 , it does this by creating new match segment records when two haplotypes have a number of consecutive shared matches that is greater than threshold number of consecutive sites. See operation 313 .
  • the threshold value may be chosen to balance speed and sensitivity. In certain examples, the threshold number is between about 50 and 1000 sites (e.g., about 200 matching sites).
  • the process may complete a match segment record when two matching haplotypes finally diverge in allele values, thereby ending the match segment. See operation 315 .
  • the process may maintain a separate report populated with matches of greater than the threshold length.
  • the matches may be identified by start position and end position (indexes) and the haplotypes involved in the match (e.g., haplotype ID #11 and haplotype ID #5).
  • the match segment includes both the starting and ending sites of the match segment.
  • the process may still flag the match segment for further consideration.
  • the match is identified by the two matching haplotypes and their starting index for the match segment.
  • the ending index for the match region is the site at the end of the haplotype.
  • FIG. 3 does not illustrate treatment of the haplotype strings by different digital templates. This will be discussed below.
  • the process 301 proceeds through operations 307 - 315 for each successive haplotype site and reorders the haplotype IDs in the positional prefix array based on matches (the haplotypes having a 0 at the current position are grouped together and those having a 1 are grouped together).
  • the haplotypes that have long stretches of matched sites stay together in the positional prefix array for long durations. This is because all matching haplotypes stay together in the positional prefix array until one of them has a different allele value at a particular haplotype position. At that point, the one or more haplotypes that diverge from the larger group are moved to a different position in the positional prefix array.
  • the method keeps sufficient information to reconstruct all IBD segments for any two haplotypes. This includes all haplotypes under consideration, including the first and last haplotypes in the positional prefix array.
  • FIG. 4 presents an example of positional prefix array values and divergence array values for a few sites on an alignment of haplotype strings.
  • potential IBD segments When there are no further haplotype sites to consider, as indicated by process block 317 , potential IBD segments have been identified and these may be processed in various ways such as, optionally, being used in a relatedness analysis of individuals whose haplotypes were considered in the analysis. See operation 319 .
  • the potential IBD segments may be further processed by a model such as an HMM model to correct for phase shift errors.
  • this additional processing is optional, particularly in situations where phased haplotype data can be expected to have relatively few phase shift errors.
  • potential IBD segments that are flagged by the PBWT process, but are shorter than a defined genetic length are excluded from further consideration.
  • An example of a threshold genetic length is between about 1.5 and 3 centimorgans. Thus, for example, if a segment extends over 200 sites but does not extend the full required genetic distance, the method discards the segment.
  • the PBWT process assumes that there are no errors. If there is in fact an error, it may prematurely truncate a sequence of matches and/or artificially prolong a sequence. Typically, long matches that in fact exist (e.g., between close relatives) are prematurely broken due to genotyping and/or phase switch errors.
  • integrating digital templates into an IBD segment finding process can mitigate the impact of some errors, particularly genotyping errors.
  • One approach employs a digital template that shifts over the haplotype strings and masks certain haplotype sites from consideration as it goes. This approach takes a normal haplotype alignment but applies the template to skip over some sites that would otherwise be considered. With the excluded sites removed from consideration, the process identifies putative IBD segment matches using a general approach such as PBWT. By masking some sites from comparison, sites of erroneous calls may be ignored. Some templates may consider the erroneously called alleles while others exclude them. By considering putative IBD segments created using all the templates, the process can remove breaks and more accurately identify complete IBD segments.
  • the template provides a sliding window of consecutive sites having, in some embodiments, a fixed mask pattern.
  • the template is moved successively along the haplotype string, typically with no overlap of sites between one application of the template and the next.
  • the process is similar to a generic IBD segment finding method such as the PBWT process. That is, in some embodiments, the process generates a positional prefix array and a divergence array for haplotype strings modified by the template.
  • the computational system flags and records matching segments as before. But the matching segments produced by single templates have some sites excluded.
  • the templating function employs a probabilistic function to pick mask sites.
  • the mask pattern is deterministic based on the template.
  • the masking of sites may follow a specific pattern, based on each template, rather than a random selection or masking of sites.
  • the mask pattern may remain fixed as the process moves from one haplotype site to the next. In some cases, however, the mask pattern may vary as the process moves over haplotype sites, but such variation may be deterministic rather than random.
  • the process employs multiple fixed templates for a given matching problem.
  • templates include ⁇ h ⁇ h (all odd sites), h ⁇ h ⁇ (all even sites), ⁇ hh, hh ⁇ , ⁇ hh ⁇ , and h ⁇ h, where sites at ⁇ will be masked out and only sites at h will be used to construct the method.
  • the choice of templates to use together in a process may be made such that for the fixed length of the templates (e.g., the four site templates exemplified here), may guarantee that if there were any errors (e.g., two errors) within this window, at least one of these templates correctly report a match. For example, if there were errors at sites 2 and 4 at one application of a four site template, only the h ⁇ h ⁇ would give an error-free read.
  • the total number of digital templates may be between 1 and k, where k is the number of haplotype sites in the window. In some implementations, the total number of digital templates is k!/(m!*(k ⁇ m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • the templates are characterized by a ratio of masked to unmasked sites which ranges between about 1/w and (w ⁇ 1)/w, where w is the length of the template window
  • the templates are characterized by a length equivalent to the total size of the haplotype alignment. As examples, a range of template lengths is between about three and ten consecutive sites.
  • a templating function can be tuned to alter sensitivity to error.
  • one templating function may be implanted as a decision tree that uses a window size of 4 haplotype sites and 6 templates, and so guarantees any matches within that 4 site window as long as there are no more than 2 errors. If i is the current template (range 0 to 5) and k is the current position within a template window (range 0 to 3), then this templating function TQ, k) may be represented as:
  • FIG. 5A illustrates application of templates to an IBD segment finding method.
  • the depicted process 501 begins by receiving data and setting up parameters for the routine. This may involve receiving phased haplotypes to be used in the templated matching routine (block 503 ), defining templates to apply (block 505 ), and setting up any needed matrices and arrays such as a positional prefix array and a divergence array.
  • the depicted process loops over the various sites of the haplotypes, and at each site it loops over the available templates. This is depicted as follows.
  • the process increments to the next haplotype site at an operation 507 , and while at the current site, it iterates over the various templates, starting by incrementing to the next template at an operation 509 .
  • the routine is fixed at a particular template, the process identifies matches and mismatches among the haplotype strings (block 511 ) and merges match segments for the various templates (block 513 ).
  • Operation 511 identifies matches only if the current haplotype site is unmasked in the current template. Assuming that the current site is unmasked, operation 511 may be implemented in various ways such as by updating positional prefix and divergence arrays. Note that each template may have its own match segment information. Using this information, operation 513 may merge currently pending segments (at the current haplotype site) from among the various templates.
  • Operation 515 serves to iterate over all the templates while the process is fixed at a given haplotype site and operation 517 serves to iterate over all haplotype sites. Ultimately all haplotype sites are considered and the error-corrected IBD segments are completed. See operation 519 .
  • FIG. 5B illustrates another application of templating to an IBD segment finding method.
  • the depicted process 551 begins by receiving data and set up parameters for the routine. This may involve receiving phased haplotypes (block 553 ) to be used in the templated matching routine, defining templates to apply (block 555 ), and setting up any needed matrices and arrays such as a positional prefix array and a divergence array.
  • the depicted process loops over the various sites of the haplotypes, and at each site it loops over the available templates. This is depicted in the figure as follows.
  • the process increments to the next haplotype site (the current haplotype site) at an operation 557 , and while at the current site, it iterates over the various templates, starting by incrementing to the next template at an operation 559 .
  • the routine is fixed at a particular template, the process identifies matches and mismatches among the haplotype strings (block 561 ) and merges match segments for the various templates (block 563 ).
  • An operation 561 identifies matches only if the current haplotype site is unmasked in the current template.
  • operation 561 may be implemented in various ways such as by updating positional prefix and divergence arrays.
  • each template may have its own match segment information.
  • operation 563 may merge currently pending segments (at the current haplotype site) from among the various templates.
  • Operation 565 serves to iterate over all the templates while the process is fixed at a given haplotype site.
  • phase switch errors may be addressed using a heuristic that recognizes typical phase switch errors.
  • An operation 567 serves to iterate over all haplotype sites. If any of the templates indicates a continuous sequence of matching sites including the current site or sites adjacent to the current site, the match sequence is deemed to continue, even if one or more of the templates indicates a gap in the match sequence. Ultimately all haplotype sites are considered and the error-corrected IBD segments are completed. See operation 569 .
  • Merging may involve aligning the putative IBD segments from each templated result, and then scanning through the template-specific segments for pairs of haplotypes. During this process, as long as one of the six templates (or however many are used) still shows a continuing segment, the method keeps a merged IBD segment intact.
  • the methods assume that any IBD start or end points within an otherwise continuous IBD segment are caused by errors. This is a reasonable assumption because the comparison is made between two individuals. There is a very low probability that two haplotypes will match, for greater than a threshold length, by chance.
  • an additional filtering operation to remove some putative IBD segments is performed after one of the above-described processes such as process 301 or process 501 .
  • the filter may operate by discarding putative IBD segments of size below three centimorgans.
  • templated PBWT Given M haplotypes with N bi-allelic sites, the PBWT algorithm can identify identical subsequences of the haplotypes in O(NM) time.
  • a limitation of PBWT is that it requires exact subsequence matches with no errors or missing data.
  • a templated PBWT may be used.
  • a templated PBWT may be designed or configured to identify matching subsequences of the haplotypes despite missing data and genotyping errors with only a small linear increase in computational time compared to the PBWT.
  • One approach for extending PBWT to report matching haplotypes that include some errors involves constructing multiple replicates of the PBWT data structure. Each of these PBWTs is built by masking the haplotype alignment using a different repeating template. Each PBWT may then be individually swept through identifying exact subsequence matches. The matching subsequences from all PBWTs (each from a different template) may then be merged to produce all matching subsequences within the full (unmasked) haplotype alignment.
  • One example uses different repeating templates: for example ⁇ h ⁇ h, h ⁇ h ⁇ , ⁇ hh, hh ⁇ , ⁇ hh ⁇ , and h ⁇ h, where sites at ⁇ will be masked out and only sites at h are used to construct the IBD segments using, e.g., PBWT.
  • These example templates address haplotype data with no more than two errors per four site window.
  • the design of these six specific templates guarantees that all matches across any given four site window will be found as long as there are no more than two errors within the window. This is because given any possible arrangement of two errors across four sites in the original haplotype alignment at least one of the PBWT replicates will have those errors masked out and therefore still deliver the match correctly.
  • This method's sensitivity to errors may be modified by changing the arrangement and number of templates. For example, more templates could be utilized to ensure matches across longer windows; indeed (n/k) templates are required to ensure all matches across windows of size n with no more than k errors per window. In practice genotyping errors are often low enough that six templates would be adequate (templates of length 4 with two sites masked); even with a genotyping error rate as high as 0.001 the probability of three errors within a four site window is 3.996 ⁇ 10 ⁇ 9 . Running each templated PBWT replicate can be easily parallelized.
  • Templating the PBWT as described above to handle errors and return subsequence matches can be executed in linear time by passing through the data only once and avoiding the need for a post-hoc merging algorithm.
  • two arrays are constructed: ppak the positional prefix array and divk the divergence array (Durbin 2014).
  • ppak is a list of the haplotypes sorted so that their reversed prefixes (from k ⁇ 1 to 0) are ordered. This ordering ensures that haplotypes that match through position k ⁇ 1 will end up adjacent to one another in ppak.
  • the divergence array divk keeps track of where those matches began, the ith element in divk represents the beginning of the match between the ith element in ppak and the i ⁇ 1th element in ppak.
  • the method constructs a separate ppaj,k and divj,k for each template j used at site k.
  • a set of templates (as described above) may be formalized as an indicator function T (j, k) with the value 0 when the template j skips over site k and 1 if template j processes site k.
  • T (j, k) is called for each template j; if T (j, k) is 1 then ppaj,k and divj,k are assembled accordingly.
  • auxiliary data structures Ps and Pe are both M by M two dimensional arrays in which the position x, y holds the start/end positions of the match between haplotype x and haplotype y. If another subsegment has already been stored the routine checks to see if the new matching subsegment overlaps and possibly extends the existing subsegment. If they do not overlap, the routine checks if the old matching segment has a genetic length (in cM) of at least Lf and then reports it. The new matching subsegment is then stored in its place.
  • the “templating” of the haplotype alignment is performed within this modified form of the PBWT itself, and matching subsegments from each template are merged and extended directly as the haplotype alignment is passed through.
  • the templated PBWT has a worst-case time complexity of O(NMt) where t represents the number of templates defined within T (j, k); thus the method represents a linear tradeoff between the speed of PBWT and sensitivity to error.
  • An example templated PBWT is further detailed as pseudocode in Algorithm 1.
  • the algorithm employs 2 parameters: (1) Lm is the minimum number of sites that a sub-segment must span within the haplotype alignment to be merged and extend other sub-segments, and (2) Lf is the final minimum length (in cM) that a segment must have to be reported by the algorithm.
  • Lm is the minimum number of sites that a sub-segment must span within the haplotype alignment to be merged and extend other sub-segments
  • Lf is the final minimum length (in cM) that a segment must have to be reported by the algorithm.
  • the algorithm handles missing data by extending the current longest match.
  • the longest matching haplotype to haplotype ppaj,k will be either ppaj ⁇ 1, k or ppaj+1, k, so if missing data in ppaj,k is encountered it is simply assumed the haplotype continues to extend the longest match.
  • FIG. 5D illustrates the Templated PBWT data structures.
  • the TPBWT passes once through an M by N by t three-dimensional structure where M is the number of haplotypes, N is the number of bi-allelic sites, and t is the number of templates.
  • M is the number of haplotypes
  • N is the number of bi-allelic sites
  • t is the number of templates.
  • Each template is a pattern at which sites are masked out (shaded out in the figure).
  • two arrays are updated.
  • the positional prefix array ppa and the divergence array div are both two dimensional arrays of size M by t.
  • each of the t columns of ppa and div are updated for the templates that are not masked out.
  • Each of the t columns in ppa contains the haplotypes sorted in order of their reversed prefixes.
  • each of the t columns in div contains the position at which matches began between haplotypes adjacent to one another in the sorted order of ppa.
  • short fragments of IBD shared between haplotypes i and j, broken up by errors are identified by each of the t templates (green arrows). As these fragments are identified they are merged and extended with one another in the current match arrays P s and P e . While merging and extending IBD fragments a heuristic may be used to scan for and fix putative phase switch errors, as will be discussed further herein.
  • FIG. 5C shows pseudocodes of the algorithm.
  • long IBD segments may be fractured by phase switch errors introduced by phasing techniques used to phase the haplotype data of the individuals. The locations and frequencies of such fractures may occur in predictable ways.
  • a heuristic is employed to correct phase switch errors as IBD segments are identified. As noted herein in FIG. 5B and operation 566 , a heuristic may be applied to identify and correct potential phase switch errors by analysis of sequential haplotype sites. In certain embodiments, a heuristic involves merging adjacent potential IBD segments (matching segments) that end within a threshold distance and then joining or swapping haplotype segments of a given individual, as necessary, to correct phase switch errors.
  • this heuristic may improve IBD determination because it is biologically unlikely that two IBD segments on the same or opposite haplotypes of an individual both end within a particular distance (e.g., about 500 or fewer SNPs on the same or opposite haplotypes).
  • the phase switch heuristic is turned off between closely related pairs of individuals, e.g., between parent and child. For example, if an individual is trio-phased (phasing a child's genotype compared to the parent's genotype), the phasing is considered highly accurate and there are few to no phase switch errors. While the phase switch heuristic is discussed in the context of a Templated PBWT process, the heuristic may be used alone or in conjunction with any of various other algorithms that identify IBD segments for phased haplotype data. Such other algorithms may or may not include analyses that identify and/or correct genotyping and similar errors.
  • the start position of the new IBD segment is compared to the end position of an adjacent IBD segment.
  • the start position of the new IBD segment and the end position of the adjacent IBD segment are on the same haplotype with a gap between them.
  • the start position of the new IBD segment and the end position of the adjacent IBD segment are on opposing haplotypes with either a gap between them or an overlap.
  • the two IBD segments may be merged to form a single IBD segment.
  • the threshold value is between about 0-500 SNPs, about 0-300 SNPs, about 200-300 SNPs, or about 0-100 SNPs.
  • the threshold value for merging adjacent IBD segments is the same threshold value for determining that two haplotypes have a minimum number of sites that a sub-segment spans to be considered a potential IBD segment (Lm). If the two IBD segments are on opposite haplotypes, portions of the haplotypes (i.e., haplotype segments) may be swapped starting at the location of a break in the IBD segments.
  • the haplotypes remain swapped unless/until the heuristic determines another phase switch error has occurred and swaps the haplotypes.
  • the haplotypes used to identify potential IBD segments remain swapped.
  • the heuristic is used to correct the actual haplotypes for phase switch errors in addition to correcting IBD segments.
  • the merged potential IBD segment must have a minimum length Lf to be deemed an IBD segment.
  • Lf minimum length
  • FIG. 6A illustrates via a series of diagrams how the heuristic may be applied for four individuals that share IBD segments with a focal person.
  • Each pair of haplotypes (0 and 1; dotted lines) represents a copy of the haplotypes of the focal person, while the grey bars represent the IBD segments the focal person shares with four other individuals.
  • the top pair of haplotypes shows IBD segments of the focal person and another individual
  • the second pair of haplotypes shows IBD segments of the same focal person but with a second other individual, and so on.
  • a focal person is provided for purposes of illustrating the heuristic, in some embodiments the heuristic is applied to correct phase switch errors in multiple or all individuals simultaneously.
  • a focal person may be a new user or customer that is added to the database of, e.g., a person genetics platform.
  • the process runs using the new user or customer as the focal individual against the entire database.
  • Panels B through F represent the TPBWT's sweep along the chromosome from left to right, with the black arrow labeled TPBWT representing the current position.
  • TPBWT sweeps along the haplotypes identifying IBD matches it uses a heuristic to identify and fix putative phase switch errors.
  • diagram A two haplotypes (0 and 1; dotted lines) of the focal person and the IBD segments they share with the four other individuals in the haplotype alignment are plotted.
  • the focal person has two phase switch errors (red dashed lines) that break up long IBD segments.
  • the Templated PBWT scans left to right along the chromosome, keeping track of IBD segments shared among all pairs of individuals.
  • phase switch error is inferred within one of the other individuals, then that other individual's haplotypes are swapped, and the focal person's haplotypes remain unswitched.
  • diagram E when the arrangement of IBD segments on the complementary haplotypes again suggests another phase switch error has been encountered the algorithm swaps the focal person's haplotypes again, but this time at the location of the other phase switch error.
  • diagram F the Templated PBWT continues to the end of haplotypes after successfully identifying phase switch errors and “stitching” IBD fragments back into correct long IBD segments.
  • the heuristic is applied to correct phase switch errors when a new potential IBD segment is identified.
  • a potential IBD segment is identified when the Templated PBWT reaches the rightmost end of the potential IBD segment.
  • a potential IBD segment is identified when the Templated PBWT reaches the rightmost end of the potential IBD segment.
  • FIGs C and E only a single new potential IBD segment is identified because the TPBWT has not reached the end of the other IBD segments, triggering their identification as potential IBD segments and application of the heuristic.
  • panel E the rightmost fragment in the second from top haplotype pair has not yet been identified since the TPBWT operation has not reached the fragment's rightmost end.
  • panel F the TPBWT has scanned further right along the chromosome and identified that fragment and applied the heuristic to it (which merged it into the long IBD segment).
  • the heuristic is applied as the Templated PWBT iterates through successive sites along all chromosomes. As potential IBD segments are identified, the proximity of the identified IBD segment to a prior IBD segment on either haplotype is determined to infer whether there is a phase switch error and the IBD segments should be ‘stitched’ together to form a single IBD segment.
  • FIG. 6B illustrates the possible scenarios considered by the Templated PBWT for adjacent IBD segments.
  • Diagram A shows a first IBD segment shared by P and Q.
  • Diagrams B-E show the various combinations of second IBD segments between P and Q that may be considered by the heuristic (the grey box indicates the two potential IBD segments are within a threshold length of each other). The second IBD segments are within a threshold number of SNPs of the first IBD segments.
  • all IBD segments are on the same haplotype, but are separated by a gap. This may result from an even number of phase switch errors, causing the haplotypes to be swapped at both ends of the gap such that the potential IBD segments are on the same haplotype.
  • phase switch errors within the gap may be unknown and is not necessary to infer which individual harbors the phase switch error(s).
  • phase switch errors may have occurred in P, Q, or P and Q, and the heuristic may be applied as a result of two potential IBD segments being identified within a threshold range of each other, as discussed herein.
  • the Templated PBWT may be used to correct for short gaps, e.g., 1-3 SNPs
  • the gap illustrated here may be larger, for example up to about 100 SNPs, or about 300 SNPs, or about 500 SNPs. This may be caused by various errors, including multiple phase switch errors within the gap, such that the matching sites are insufficiently long to be considered potential IBD segments.
  • the heuristic as described herein infers that two segments within the threshold distance are likely to be a single segment broken up by errors, and thus merges them despite the gap.
  • Diagram C illustrates the second IBD segments being on opposite haplotypes for both P and Q, which may be the result of a phase switch error in both individuals. In such cases, the haplotypes may be swapped in both individuals.
  • Diagrams D and E illustrate either Q or P, respectively, having second IBD segments on the opposite haplotype. In these scenarios, if the second IBD segments are within a threshold distance of the first IBD segments but on the opposite haplotype, a phase switch error is inferred and the haplotypes from the second IBD segments forward may be swapped and the first and second IBD merged.
  • the Templated PBWT handles haplotype error (miscalls) and missing data. It is also robust to “blip” phase switch errors in which the phase at a single site is swapped. However, phase switch errors spaced out along the chromosome will cause long regions of the haplotypes to be swapped and fragment IBD segments as illustrated in FIG. 1 . To handle these errors the Templated PBWT may apply a phase correction heuristic that scans for certain patterns of haplotype sharing to identify and correct phase switch errors. Note that for haploid data sets such as human male sex chromosomes this heuristic can be turned off. Large cohorts of samples have patterns of haplotype sharing that are highly informative regarding the location of phase switch errors.
  • phase switch errors in an individual will fragment all IBD segments shared with that individual at the position of the switch error.
  • Each IBD segment that spans the switch error will be broken into two fragments at the position of the error: these fragments will be on complementary haplotypes within the individual with the error and yet may remain on the same haplotype within the other individual.
  • this pattern of haplotype sharing may be the result of actual recombination patterns, however for the majority of more distantly related individuals the pattern can be used to identify phase switch errors.
  • the new segment begins near the end of the existing segment and the new segment is on the same haplotype as the existing segment in individual P but on the complementary haplotypes in individual Q, then possibly there was a phase switch error in individual Q. And of course, the opposite pattern could exist suggesting a phase switch error in individual P.
  • the TPBWT will swap the haplotypes for the individuals containing the error (See FIG. 6B ). Now the new IBD segments merge and extend the fragments on the complementary haplotype that were broken up by the phase switch error.
  • the algorithm stops swapping the individual's haplotypes. This simple heuristic continues to the end of haplotypes “stitching” short stretches of IBD fragmented by errors back into the correct long IBD segments.
  • FIG. 6C illustrates a process for using a phase switching error correction heuristic as described herein.
  • the process 600 begins by receiving haplotype data for a plurality of individuals.
  • the process 600 may be performed while scanning the haplotype data using a method such as a Templated PBWT (e.g., during operation 566 in FIG. 5B ) or may be performed as a standalone process.
  • the process begins by receiving phased haplotype data to be considered.
  • the process loops over multiple haplotype sites, considering each one separately, but also considering adjacent and/or near sites that contain IBD segments, particularly nearly terminated IBD segments along with newly started IBD segments.
  • An operation 602 sets the next haplotype site for consideration.
  • this site incrementing operation may already have been performed.
  • An operation 603 determines phased haplotypes of at least two individuals have first potential IBD segments that terminate at a first location.
  • first potential IBD segments may be identified after a matching subsequence having at least Lm sites terminates. The subsequences must possess more than a threshold length Lm to be considered possible IBD segments.
  • the first potential IBD segments terminate at a first location which may be stored for later reference.
  • An operation 605 determines that the at least two individuals have second IBD segments that start at a second location within a threshold distance of where the first IBD segment ended.
  • the threshold distance may be as described above. In some implementations, the distance may be either a gap or an overlap between the first and second potential IBD segments.
  • An operation 607 identifies or infers which individual, from among those having second IBD segments that starts at a location within the threshold distance of where the first IBD segment ended, likely has a phase switch error.
  • the second potential IBD segments may be between any combination of haplotypes of the at least two individuals. See FIG. 6B . When the second potential IBD segment begins on the opposite haplotype as the first potential IBD segment in at least one of the individuals, a phase switch error is implied in the individual that has the first and second IBD segments on opposite haplotypes.
  • the operation infers that that individual has the phase switch error, and only that individual's haplotypes need correction for the phase switch error.
  • the operation may infer that both individuals have phase switch errors at the same or proximate positions.
  • operation 607 may be skipped. For example, as shown in diagram B of FIG. 6B , above, the potential IBD segments to be merged are on the same haplotype. In such embodiments it may be difficult to determine which individual had a phase switch error and also unnecessary to properly merge the potential IBD segments (as swapping the haplotypes is not necessary to have the potential IBD segments on the same haplotype).
  • the first potential IBD segments and the second potential IBD segments are merged. If the first potential IBD segments and the second potential IBD segments are on opposite haplotypes for any of the at least two individuals (i.e., a phase switch error occurred for those individuals), the haplotypes may be swapped for those individuals. The swap may occur at the location of the phase switch error.
  • Operation 611 is an optional operation to determine whether each potential IBD segment is sufficiently long and/or meets other criteria to be considered a true IBD (e.g., a minimum length Lf). If the criteria are met, the potential IBD segments are determined to be actual IBD segments.
  • Operation 613 is an optional operation to correct for potential genotyping errors. See e.g., the discussion of the Templated PBWT.
  • the current haplotype site is checked for whether it is the last haplotype site. If it is the last haplotype site, the process finishes. If it is not the last haplotype site, the process returns to operation 602 to select the next haplotype site and continue scanning for IBD segments.
  • process 600 is part of another method to identify IBD segments, e.g., a Templated PBWT, the loop may also allow for the Templated PBWT algorithm to continue scanning the next haplotype site.
  • HMM Hidden Markov Model
  • FIG. 7 schematically illustrates using a HMM to process four haplotypes of two individuals (individual 1 and individual 2 , two haplotypes for each individual on chromosome 5) to correct phase switch errors, “stitching” IBD segments fractured by phase switch errors.
  • the HMM process covers the full span of the four haplotypes from left to right shown sequentially in the top panel, lower left panel, and lower right panel.
  • FIG. 8 shows a flow diagram illustrating process 800 for correcting phase switch errors in IBD segments using a hidden Markov model (HMM) according to some implementations.
  • the error correction optionally is initiated or triggered when IBD segments of the two individuals being compared meet one or more criteria. See decision box 802 .
  • This conditional trigger can avoid processing IBD segments that may not need corrections.
  • the HMM error correction process is triggered when the two individuals' IBD segments include two or more IBD segments on a single chromosome. This can avoid applying error correction when there is only a single IBD segment on a single chromosome where no phase switch errors have occurred.
  • the criterion is met when the two individuals' IBD segments exceed a minimum total amount of shared IBD.
  • process 800 proceeds to obtain an IBD state for each polymorphic site of a series of polymorphic sites of the two individuals. See the box 802 , “Yes” branch and box 804 .
  • the IBD state indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and if so, which of the two individuals' phased haplotypes are part of the IBD segment.
  • the series of polymorphic sites are located in one or more pairs of chromosomes of each individual.
  • the polymorphic sites are biallelic sites. In other implementations, more than two alleles may be implemented at a site.
  • the IBD states indicate different conditions of zero IBD, half IBD, and full IBD. In some implementations when the polymorphic site is a biallelic site, the IBD states include nine different IBD states corresponding to nine conditions of zero IBD, half IBD, and full IBD as further described in examples hereinafter.
  • Process 800 then involves applying the HMM to the IBD states. Box 806 .
  • the HMM model takes the IBD states as inputs and uses them as observed states of the model.
  • the HMM model also takes as input (i) a rate of recombination based on a number of meioses (m), (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes, and, optionally, (iii) genetic distances between consecutive sites on a chromosome. In some implementations, genetic distances between consecutive sites on a chromosome may be omitted.
  • model input herein refers to both variables and parameters.
  • the HMM model's transmission rates or probabilities depend on (i) and (ii), and optionally (iii).
  • the application of the HMM model removes likely phase switch errors and produces error corrected IBD segments based on a most likely sequence of hidden IBD states given the observed IBD states. See block 808 .
  • Applying the HMM involves using transition probabilities and emission probabilities of the HMM to identify the most likely sequence of hidden IBD states given the observed IBD states.
  • the most likely sequence of hidden IBD states is identified using the Viterbi dynamic programming process.
  • Process 800 is implemented using a computer. It is not practical or feasible to apply the model without a computer due to the complexity of the model. For example, applying the HMM requires using a 36 ⁇ 36 transmission matrix and a 36 ⁇ 36 emission matrix for each polymorphic site, often at hundreds of thousands of polymorphic site, to calculate a most likely sequence. It can take many years and errors for a person to calculate just a single Viterbi sequence.
  • the error correction process involves only the operations illustrated in boxes 804 , 806 , and 808 .
  • Such implementations include: (a) for each polymorphic site in a series of polymorphic sites of two individuals, obtaining an IBD state that indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and, if so, which of the two individuals' phased haplotypes are part of the IBD segment, wherein the series of polymorphic sites are comprised in or lie along one or more pairs of chromosomes; and (b) applying a hidden Markov model (HMM) to the IBD states to produce one or more error-corrected IBD segments, wherein the HMM model takes as input, in addition to the IBD states as observed IBD states, (i) a rate of recombination based on a number of meioses, (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes, and (ii
  • Some implementations of the disclosure include multiple iterations of applying the HMM to test different numbers of meioses (m). As illustrated in FIG. 8 , process 800 determines whether there are additional values of m that need to be tested. If so, the process loops back to box 804 to obtain IBD states and apply the HMM using a different value of m. In some implementations, the different numbers of meioses are in the range from 0.1 to 14 crossovers. In some implementations, the values of m are in the range from 1 to 14. See block 810 , “Yes” branch. If there are no additional values of m to be tested, process 800 proceeds to use the set of error corrected IBD segments having the highest probability as a final estimate of IBD segments for the two individuals. See decision block 810 , “No” branch, and block 812 . Thereafter, process 800 ends at block 814 .
  • FIG. 9A schematically illustrates the structure of the HMM model. It includes a series of hidden states (illustrated as circles on top) representing the ground-truth IBD states at a series of polymorphic sites and a series of observed states (illustrated as circles at the bottom) representing the observed IBD states based on phased haplotype data of the two individuals.
  • the arrows in the diagram denote conditional dependencies.
  • the hidden states obey the Markov property, such that the hidden state at any site depends on only the hidden state at the immediately previous site. In other words, H l depends only on H l ⁇ 1 . Moreover, the observed state at a particular site depends only on the hidden state at the particular site. In other words, O l depends on only H l .
  • the state space of the hidden variable is discrete.
  • the parameters of a HMM are of two types, transition probabilities and emission probabilities.
  • the transition probabilities between site l ⁇ 1 and site l determine the probability of H l given H l ⁇ 1 .
  • the emission probabilities at site l determine the probability of O l given H l .
  • Pr ( H 1 , H 2 , H 3 , ... ⁇ , O 1 , O 2 , O 3 , ... ⁇ ) Pr ⁇ ( H 1 ) ⁇ Pr ⁇ ( O 1 ⁇ H 1 ) ⁇ Pr ⁇ ( H 2 ⁇ H 1 ) ⁇ Pr ⁇ ( O 2 ⁇ H 2 ) ⁇ Pr ⁇ ( H 3 ⁇ H 2 ) ⁇ Pr ⁇ ( O 3 ⁇ H 3 ) ( Eq . ⁇ 1 )
  • H i ) are emission probabilities/parameters
  • H i ⁇ 1 ) are the transition probabilities/parameters.
  • the hidden state space assumes one of N possible values, modeled as a discrete distribution. For each of the N possible states that a hidden variable at point l can be in, there is a transition probability from this state to each of the N possible states of the hidden variable at point l+1, for a total of N 2 transition probabilities. Note that the set of transition probabilities for transitions from any given state must sum to 1. As such, the N ⁇ N matrix of transition probabilities is a Markov matrix.
  • the emission probabilities governing the distribution of the observed variable at a particular point given the state of the hidden variable at that point.
  • the size of this set depends on the nature of the observed variable. For example, if the observed variable is discrete with M possible values, governed by a discrete distribution, there will be a total of N ⁇ M emission probabilities.
  • each polymorphic site is biallelic, and the IBD states at any site can include nine different IBD states, indicating nine conditions of zero IBD, half IBD, and full IBD.
  • site l can be observed as IBD between the two individuals.
  • the IBD state at site l notated as c* ⁇ is represented by a string of 4 integers each corresponding to the 4 haplotypes. The first two integers refer to the maternal and paternal haplotypes in individual 0 and the last two integers refer to the maternal and paternal haplotypes in individual 1.
  • the haplotype at site l is not IBD practitioners represent it as a 0.
  • the IBD states are expanded by multiplying these different 9 conditions of IBD with four types of phase switch errors. But if one disregards the phase switch error types, there would be 9 ⁇ 9 transition rates between hidden states of two consecutive sites.
  • transition rates of the HMM are based upon a rate at which IBD segments start.
  • the rate at which IBD segments start is modeled as a function of the number of meioses. See box 706 , input (i).
  • the rate at which IBD segments start ( ⁇ s ) is modeled as follows.
  • m is the number of meioses, and the recombination rate is assumed to be 1 crossover per 100 cM.
  • transition rates of hidden IBD states are based on a rate at which IBD segments end.
  • the rate at which IBD segments end is modeled as a function of the number of meioses.
  • the rate at which IBD segments ends ( ⁇ e ) is modeled as follows.
  • m is the number of meioses.
  • the IBD states include nine different IBD states, and transition rates are based on a transition matrix Q a in FIG. 9B .
  • each row includes rates for transitioning from the IBD state denoted by the four-letter string at site l to nine IBD states at l+1.
  • the transition rates of hidden IBD states are weighted by a probability that full IBD between the two individuals is truly present.
  • the probability that the full IBD between the two individuals is truly present is modeled as a logistic function of an amount of estimated full IBD.
  • the probability that full IBD between the two individuals is truly present ( ⁇ ) is modeled as follows.
  • ⁇ 2 is the amount of estimated full IBD
  • is an empirical parameter defining the steepness of the logistic function
  • the transition rates of hidden IBD states are weighted by weighting transitions into full IBD states with ⁇ , and waiting transitions out of full IBD states with 1/ ⁇ .
  • the IBD states include nine different IBD states, and the transition rates of hidden IBD states are based on a transition matrix as follows.
  • the transition rates of hidden IBD states are based on the at least one rate of phase switch error. See block 706 , model input (ii).
  • the IBD states include nine different IBD states as described herein.
  • the at least one rate of phase switch error includes a rate of phase switch error for each of the two individuals, ⁇ 1 and ⁇ 2 , respectively.
  • the phase switch error rates for the two individuals are the same when the same phasing method is used for both individuals.
  • the transition rates are based on the 36 ⁇ 36 transition matrix described as follows.
  • transition probabilities of hidden IBD states are based upon genetic distances between consecutive sites on a chromosome. See box 706 , model input (iii). In some implementations, transition probabilities of hidden IBD states are obtained by exponentiating a transition matrix. In some implementations, transition probabilities of hidden IBD states Y l+1 given hidden IBD states Y l are modeled as:
  • ⁇ 0 is a phase switch error rate for a first individual of the two individuals
  • ⁇ 1 is a phase switch error rate for a second individual of the two individuals
  • ⁇ 2 is an amount of estimated full IBD
  • Q is a transition matrix described by Eq. 6
  • d l is the genetic distances between sites l and l+1.
  • the emission probabilities of the HMM are dependent on phase switch errors. In some implementations, the emission probabilities are defined by a uniform error term that weights probabilities of observed IBD states based on the four different ways the two individuals may be in phase switch errors.
  • IBD segments shared between two related individuals are generated by passing along the four haplotypes of the two individuals. IBD segments begin and end following a Poisson process with rates that are determined by the number of meioses m that occurred on the pedigree between the two individuals. Phase switch errors occur following a Poisson process with a rate ⁇ determined by empirically testing statistical phasing methods.
  • Y l represents the different ways site l could be observed as IBD plus the different ways the two individuals may be in a phase switch error.
  • site l can be observed as IBD between the two individuals.
  • Practitioners notate the IBD state at site l as c* l , which is represented by a string of 4 integers each corresponding to the 4 haplotypes. The first two integers refer to the maternal and paternal haplotypes in individual 0 and the last two integers refer to the maternal and paternal haplotypes in individual 1.
  • haplotype at site 1 is not IBD inventors represent it as a 0.
  • c* l 0000 indicates that the two individuals at site 1 are not IBD, or zero IBD. Accordingly, there are 4 different ways the two individuals could be half IBD: 0101 is when the individuals are IBD through their paternal haplotypes, 1001 is when the individual 0's maternal haplotype is IBD with individual 1's paternal haplotype, 0110 is when the individual 0's paternal haplotype is IBD with individual 1's maternal haplotype, and 1010 is when the individuals are IBD through their maternal haplotypes.
  • hidden states Y l represents the different ways site l could be observed as IBD and also includes information about the different ways in which the two individuals may or may not be in a switch error.
  • Practitioners model the transitions among hidden states Y l with an instantaneous transition rate matrix. If, for a moment, practitioners do not consider transitions in which phase switch errors may occur and practitioners only consider transitions among the 9 IBD states that can be observed, practitioners can define the transition matrix Q a shown in FIG. 9A .
  • the matrix Q a defines the way the model moves between zero, half, and full IBD states. As the model passes along the chromosome ⁇ s is the rate at which IBD segments begin
  • ⁇ e represents the length of the IBD segments shared between individuals 0 and 1.
  • ⁇ s represents the length of segments with no IBD shared between the two individuals.
  • Phase switch errors break up half IBD segments into shorter adjacent half IBD segments on different haplotypes. Since the templated PBWTs procedure described above imperfectly estimates the start and end positions of IBD segments, when the lengths of the two adjacent half IBD segments are over estimated this can result in a short region of erroneous full IBD. Since full IBD is not expected for most pairs of relatives we model the error in the observed proportion of full IBD using a simple logistic function. Practitioners indicate the probability of full IBD truly being present as ⁇ , which is defined as
  • ⁇ 2 is the amount of full IBD estimated by the templated PBWTs.
  • ⁇ 2 ⁇ 25% the amount expected for full siblings
  • ⁇ 2 approaches zero
  • also approaches zero.
  • the probability of observing the IBD state c* l given the possible phase switch errors at site l is P(c* l
  • Phased IBD Phased IBD. It is used in the experiments described hereinafter. It has two stages: First the templated PBWT and then the phase-correcting HMM.
  • the templated PBWT stage generates the IBD segments among all haplotypes very quickly and efficiently.
  • the second stage of the algorithm the HMM
  • the HMM the second stage of the algorithm
  • phase switch errors have not broken up their observed IBD segments and so the HMM does not apply.
  • the HMM the slow stage of the 2-part algorithm, is thus only applied to the small number of individuals within the dataset that are closely related. Practitioners require a pair of individuals to have at least 2 observed IBD segments on a single chromosome before running them through the phase-correcting HMM, though additionally we can require a minimum total amount of shared IBD (in cM) to increase the speed of the entire algorithm.
  • IBD segments can be used for a wide range of purposes. For instance, the amount (length and number) of IBD sharing depends on the familial relationships between the tested individuals. Therefore, one application of IBD segment detection is to quantify relatedness. For example, methods for using IBD segments to quantify relatedness are described in U.S. Pat. No. 8,463,554, issued Jul. 11, 2013, which is incorporated by reference in its entirety for all purposes.
  • the number of shared IBD segments and the amount of DNA shared by two users are computed based on the IBD segments obtained as described above. In some implementations, the longest IBD segment is determined. In some implementations, the amount of DNA shared includes the sum of the lengths of IBD regions and/or percentage of DNA shared. The sum is referred to as IBDhalf or half IBD because the individuals share DNA identical by descent for at least one of the homologous chromosomes. The predicted relationship between the users, the range of possible relationships, or both, is determined using the IBDhalf and number of segments, based on the distribution pattern of IBDhalf and shared segments for different types of relationships.
  • the individuals have IBDhalf that is 100% the total length of all the autosomal chromosomes and 22 shared autosomal chromosome segments; in a second degree grandparent/grandchild relationship, the individuals have IBDhalf that is approximately half the total length of all the autosomal chromosomes and many more shared segments; in each subsequent degree of relationship, the percentage of IBDhalf of the total length is about 50% of the previous degree. Also, for more distant relationships, in each subsequent degree of relationship, the number of shared segments is approximately half of the previous number.
  • the distribution patterns are determined empirically based on survey of real populations. Different population groups may exhibit different distribution patterns. For example, the level of homozygosity within endogamous populations is found to be higher than in populations receiving gene flow from other groups.
  • the bounds of particular relationships are estimated using simulations of IBD using generated family trees. Based at least in part on the distribution patterns, the IBDhalf, and shared number of segments, the degree of relationship between two individuals can be estimated.
  • IBD segments can also be used determine ethnicity or ancestry. See, e.g., U.S. patent application Ser. No. 15/664,619, filed Jul. 31, 2017, which is incorporated by reference in its entirety for all purposes.
  • IBD can be used to perform genotype imputation.
  • Genotype imputation refers to the statistical inference of genotype information not directed assayed. This is especially helpful because many individuals only have sparsely assayed genotype data, usually targeting a limited number of genetic markers in the genome.
  • IBD segments are determined between two individuals, it can be inferred that the genotype of the two individuals are the same in the IBD segments.
  • the known genotype information of an IBD segment of one of the two individuals can be “imputed” into that of the other individual.
  • This further allows association study between phenotypes and genotypes even using individuals that have only the phenotype data collected but not the genotype data assayed. See, e.g., U.S. patent application Ser. No. 15/256,388, filed Sep. 2, 2016, which is incorporated by reference in its entirety for all purposes.
  • FIG. 10 is a functional diagram illustrating a programmed computer system for performing the pipelined ancestry prediction process in accordance with some implementations.
  • Computer system 100 which includes various subsystems as described below, includes at least one microprocessor subsystem (also referred to as a processor or a central processing unit (CPU)) 102 .
  • processor 102 can be implemented by a single-chip processor or by multiple processors.
  • processor 102 is a general purpose digital processor that controls the operation of the computer system 100 . Using instructions retrieved from memory 110 , the processor 102 controls the reception and manipulation of input data, and the output and display of data on output devices (e.g., display 118 ).
  • processor 102 includes and/or is used to provide phasing, genotype error correction, and/or phasing error correction, etc. as described herein.
  • Processor 102 is coupled bi-directionally with memory 110 , which can include a first primary storage, typically a random access memory (RAM), and a second primary storage area, typically a read-only memory (ROM).
  • primary storage can be used as a general storage area and as scratch-pad memory, and can also be used to store input data and processed data.
  • Primary storage can also store programming instructions and data, in the form of data objects and text objects, in addition to other data and instructions for processes operating on processor 102 .
  • primary storage typically includes basic operating instructions, program code, data, and objects used by the processor 102 to perform its functions (e.g., programmed instructions).
  • memory 110 can include any suitable computer-readable storage media, described below, depending on whether, for example, data access needs to be bi-directional or uni-directional.
  • processor 102 can also directly and very rapidly retrieve and store frequently needed data in a cache memory (not shown).
  • a removable mass storage device 112 provides additional data storage capacity for the computer system 100 , and is coupled either bi-directionally (read/write) or uni-directionally (read only) to processor 102 .
  • storage 112 can also include computer-readable media such as magnetic tape, flash memory, PC-CARDS, portable mass storage devices, holographic storage devices, and other storage devices.
  • a fixed mass storage 120 can also, for example, provide additional data storage capacity. The most common example of mass storage 120 is a hard disk drive.
  • Mass storage 112 , 120 generally store additional programming instructions, data, and the like that typically are not in active use by the processor 102 . It will be appreciated that the information retained within mass storage 112 and 120 can be incorporated, if needed, in standard fashion as part of memory 110 (e.g., RAM) as virtual memory.
  • bus 114 can also be used to provide access to other subsystems and devices. As shown, these can include a display monitor 118 , a network interface 116 , a keyboard 104 , and a pointing device 106 , as well as an auxiliary input/output device interface, a sound card, speakers, and other subsystems as needed.
  • the pointing device 106 can be a mouse, stylus, track ball, or tablet, and is useful for interacting with a graphical user interface.
  • the network interface 116 allows processor 102 to be coupled to another computer, computer network, or telecommunications network using a network connection as shown.
  • the processor 102 can receive information (e.g., data objects or program instructions) from another network or output information to another network in the course of performing method/process steps.
  • Information often represented as a sequence of instructions to be executed on a processor, can be received from and outputted to another network.
  • An interface card or similar device and appropriate software implemented by (e.g., executed/performed on) processor 102 can be used to connect the computer system 100 to an external network and transfer data according to standard protocols.
  • various process implementations disclosed herein can be executed on processor 102 , or can be performed across a network such as the Internet, intranet networks, or local area networks, in conjunction with a remote processor that shares a portion of the processing.
  • Additional mass storage devices can also be connected to processor 102 through network interface 116 .
  • auxiliary I/O device interface can be used in conjunction with computer system 100 .
  • the auxiliary I/O device interface can include general and customized interfaces that allow the processor 102 to send and, more typically, receive data from other devices such as microphones, touch-sensitive displays, transducer card readers, tape readers, voice or handwriting recognizers, biometrics readers, cameras, portable mass storage devices, and other computers.
  • various implementations disclosed herein further relate to computer storage products with a computer readable medium that includes program code for performing various computer-implemented operations.
  • the computer-readable medium is any data storage device that can store data which can thereafter be read by a computer system.
  • Examples of computer-readable media include, but are not limited to, all the media mentioned above: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as optical disks; and specially configured hardware devices such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs), and ROM and RAM devices.
  • Examples of program code include both machine code, as produced, for example, by a compiler, or files containing higher level code (e.g., script) that can be executed using an interpreter.
  • the computer system shown in FIG. 10 is but an example of a computer system suitable for use with the various implementations disclosed herein.
  • Other computer systems suitable for such use can include additional or fewer subsystems.
  • bus 114 is illustrative of any interconnection scheme serving to link the subsystems.
  • Other computer architectures having different configurations of subsystems can also be utilized.
  • FIG. 11 is a block diagram illustrating an implementation of an IBD-based personal genomics services system that provides services based on IBD information, which include but are not limited to relatedness estimation, relative detection, ancestry determination, and genotype-phenotype association.
  • a user uses a client device 1102 to communicate with an IBD-based personal genomics services system 1106 via a network 1104 .
  • Examples of device 1102 include a laptop computer, a desktop computer, a smart phone, a mobile device, a tablet device or any other computing device.
  • IBD-based personal genomics services system 1106 is used to perform a pipelined process to predict ancestry based on a user's IBD information.
  • IBD-based personal genomics services system 1106 can be implemented on a networked platform (e.g., a server or cloud-based platform, a peer-to-peer platform, etc.) that supports various applications.
  • a networked platform e.g., a server or cloud-based platform, a peer-to-peer platform, etc.
  • implementations of the platform perform ancestry prediction and provide users with access (e.g., via appropriate user interfaces) to their personal genetic information (e.g., genetic sequence information and/or genotype information obtained by assaying genetic materials such as blood or saliva samples) and predicted ancestry information.
  • the platform also allows users to connect with each other and share information.
  • Device 110 can be used to implement 1102 or 1106 .
  • DNA samples e.g., saliva, blood, etc.
  • the genotype information is obtained (e.g., from genotyping chips directly or from genotyping services that provide assayed results) and stored in database 1108 and is used by system 1106 to make ancestry predictions.
  • Reference data including genotype data of reference individuals, simulated data (e.g., results of machine-based processes that simulate biological processes such as recombination of parents' DNA), pre-computed data (e.g., a precomputed reference haplotype data used in phasing and model training) and the like can also be stored in database 1108 or any other appropriate storage unit.
  • This experiment compares a method according to some implementations as described above to other computer implemented methods known in the art. All of these methods are computer-implemented. IBD accuracies and computer performances are compared among the methods.
  • Phased IBD Phased IBD. It includes techniques as described in the templated PBWT and the HMM examples above.
  • PBWT Burrows-Wheeler transform
  • RaPID RaPID
  • A. Naseri X. Liu, S. Zhang, and D. Zhi. Ultra-fast identity by descent detection in biobank-xcale cohorts using positional Burrows-Wheeler transform. bioRxiv, page 103325, 2017.
  • Browning The method described by Browning is labeled as Refined IBD. See, B. L. Browning and S. R. Browning. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics, 194(2):459-471, 2013.
  • hap-IBD A method described by Zhou is labeled hap-IBD. See, Y. Zhou, S. R. Browning, and B. L. Browning. A fast and simple method for detecting identity by descent segments in large-scale data. BioRxiv, 2019.
  • Shemirani A method described by Shemirani is labeled iLASH. See, R. Shemirani, G. M. Belbin, C. L. Avery, E. E. Kenny, C. R. Gignouz, and J. L. Ambite. Rapid detection of identity-by-descent tracts for mega-scale datasets. BioRxiv, page 749507, 2019.
  • FIG. 12 shows results comparing the speed of different IBD inference methods.
  • IBD segments were computed for 50781 SNPs from human chromosome 1.
  • the x-axis shows the number of haplotypes and the y-axis shows the time in seconds to infer IBD.
  • Phased IBD used a minimum IBD segment length of 200 sites and 1.5 cM.
  • Durbin' s PBWT used a 200 site minimum.
  • RaPID version 1.2.3 used 10 runs, 2 successes, window size of 35, and minimum length of 1.5 cM.
  • Refined IBD used a minimum length of 1.5 cM and all default parameter value settings.
  • the results of FIG. 11 show that Phased IBD and PBWT are the fastest among the four methods and similar to each other. RaPID is the slowest.
  • Phased IBD can correct various errors (including genotyping errors and phase switch errors) that cannot be addressed by PBWT, it is noteworthy that Phased IBD achieves similar computational speed as PBWT.
  • RaPID and Refined IBD can correct errors, albeit to a lesser extent than Phased IBD as shown in FIGS. 14 and 15 , they require significantly longer computer run time.
  • FIGS. 13-16 compare the IBD estimate errors (or the opposite of accuracy) of various methods.
  • FIG. 13 compares the absolute error in number of IBD segments between the Templated PBWT method (x-axis) and Phased IBD (y-axis) that includes both Templated PBWT component and the HMM component.
  • the results of FIG. 13 show that the HAIM process greatly reduce the error rates, reducing maximum error segments by about three folds from about 300 to 100.
  • FIG. 14 shows that Phased IBD is more accurate than PBWT.
  • Each axis shows the proportion of the genome with incorrect IBD estimates.
  • PBWT x-axis
  • y-axis is sensitive to both genotyping and phasing errors compared to Phased IBD
  • FIG. 15 shows that Phased IBD is more accurate than Refined IBD.
  • Refined IBD x-axis
  • Phased IBD y-axis
  • Refined IBD outperforms both PBWT and RaPID.
  • FIG. 16 shows that Phased IBD is more accurate than RaPID (version 1.2.3).
  • RaPID x-axis
  • PBWT PBWT
  • Phased IBD “stitches” together long IBD segments highly fragmented by phasing and genotyping errors.
  • Recombination was simulated using a Poisson model with a rate of 1 expected crossover per 100 cM. This resulted in simulated haplotypes for 2000 closely related pairs of individuals with perfectly known IBD segments, 400 pairs of each relationship type: parent-child, grandparent-grandchild, aunt-niece, first cousins, and siblings.
  • Genotyping errors were introduced into the simulated data set using a simple model. At each position along the simulated chromosomes an error in the genotype call was introduced with a probability of 0.001. When a site was selected for an error, half of the genotype call would be flipped with equal probability (e.g., a 0/0 genotype would be converted to a 1/0 or a 0/1 with equal probability).
  • Statistical phasing errors were also introduced into the simulated haplotype datasets. All of the simulated haplotypes were converted into their respective diploid genotypes and then the statistical haplotype phasing method Eagle2 was used. For the phasing reference panel a phasing panel that included about 200000 non-Europeans and about 300000 Europeans was used.
  • FIG. 17 shows that Templated PBWT had less error in the estimated number of IBD segments shared between relatives than all other methods analyzed.
  • the y-axis represents the number of erroneous IBD segments estimated for a simulated pair of relatives. Error was highest in closely related pairs that shared long IBD segments, particularly parent-child and siblings.
  • FIG. 18 shows that Templated PBWT had less error in the estimated percentage of the genome that is IBD in simulated relatives than other methods.
  • PBWT had less than error than other methods except Templated PBWT, while hap-IBD and Refined IBD had the largest error. Error was higher in simulated pairs that shared long IBD segments, such as parent-child, compared to more distance relatives pairs such as first cousins. Compared to Templated PBWT, the other methods were more sensitive to phasing and genotyping errors in estimated IBD segments.
  • FIG. 19 shows false negative (charts 1901 and 1905 ) and false positive rates (charts 1903 a - b and 1907 a - b ) of inferring IBD by various methods. Rates were calculated for bins of IBD segment lengths. False negative rate by segment is the proportion of true segments in a size bin that do not overlap any segment compared to the total number of true segments in the size bin. False negative rate by segment coverage is the proportion of the length of true segments in a size bin not covered by any estimated segment compared to the total length of true segments in the size bin. False positive rate by segment is the proportion of estimated segments in a size bin that do not overlap any true segment compared to the total number of estimated segments in the size bin.
  • False positive rate by segment coverage is the proportion of the length of estimated segments in a size bin not covered by any true segment compared to the total length of estimated segments in the size bin.
  • Plots 1903 b and 1907 b present the false positive rate with a smaller y-axis scale than plots 1903 a and 1907 a, respectively.
  • IBD segments ⁇ 4 cM all methods had low false positive rates.
  • IBD segments greater than ⁇ 6 cM the Templated PBWT outperformed all other methods.
  • FIG. 20A shows IBD computation runtimes for various methods. All methods were run using 1 CPU core. Templated PBWT was faster than all other methods except Durbin' s PBWT. The relative time shows the runtime to compute IBD for each haplotype in sample sizes of 400 to 20000 haplotypes relative to the time needed to compute IBD for each haplotype in a sample size of 400. A slope near zero indicates linear time complexity, while a positive slope indicates super-linear time complexity. Templated PBWT shows a near linear time complexity. FIG. 20B provides additional compute times for parallelized IBD analyses with large sample sizes.
  • Times are shown for in-sample IBD computes on 1 million individuals, out-of-sample IBD computes on 10 k individuals against 1 million, and out-of-sample IBD computes on 10 k individuals against 10 million.
  • the first two rows show the compute times measured when IBD was estimated over 42927 sites of human chromosome 1.
  • the last three rows show those compute times extrapolated to 23 chromosomes with a total of 600 k sites.
  • the last row additionally extrapolates the time for an out-of-sample analysis on 1 million to 10 million individuals.
  • CPU time is the sum of the computation time for all compute cores.
  • Wall clock time is the “real” time that the entire analysis took to run.
  • the v4 platform had 453065 SNPs and v5 platform had 544042 SNPs.
  • Haplotypes were phased using Eagle2 as described in Loh et. al., Reference-based phasing using the haplotype reference consortium panel. Nature genetics, 48(11):1443, 2016. Individuals on the v4 platform were phased with a reference panel containing 691759 samples. Individuals on the v5 platform were phased with a reference panel containing 286305 samples.
  • IBD sharing among the 9517 individuals was computed using the Templated PBWT with the parameters described in Table 1. IBD estimates among individuals on the same genotyping platform were made using the in-sample method described above, and estimates made among individuals on different platforms was made using the out-of-sample approach described above over the intersection of platform SNPs (only the SNPs present in both the v4 and v5 genotyping platforms).
  • Hierarchical clustering of the mean pairwise IBD haplotype sharing across Mexican states was performed using Ward's method (Ward Jr 1963) in R. To remove close relatives we excluded any pair of individuals that shared more than 20 cM. Geographic maps of the mean pairwise IBD shared across Mexican states were made using the R packages mxmaps, ggplot2, and viridis (Valle-Jones 2019; Wickham 2016; Gamier 2018).
  • FIGS. 21 and 22 show IBD haplotype sharing across Mexican states as determined by a Templated PBWT method.
  • Hierarchal clustering of IBD sharing across Mexican states identified geographic clusters with elevated levels of haplotype sharing, as shown in FIG. 21 . There were two large clusters: one cluster in the Yucatan peninsula and the southern Mexican states, and another cluster representing Mexico City and the central and northern states. The clusters were further subdivided into individual states.
  • Mean pairwise IBD haplotype sharing was highest within states and among geographically neighboring states, as shown in FIG. 22 .
  • mean IBD shared among individuals with all 4 grandparents from Nuevo Leon was over 12 cM
  • the mean pairwise IBD shared between individuals with all 4 grandparents from Nuevo Leon and individuals with all 4 grandparents from neighboring Coahuila and Tamaulipas was over 10 cM
  • mean pairwise sharing between individuals with all 4 grandparents from Nuevo Leon and individuals with all 4 grandparents from Yucatan was less than 6 cM. Similar geographically stratified IBD sharing was found throughout Mexico, as shown in FIG. 22 .

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
US16/947,107 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments Abandoned US20210020266A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/947,107 US20210020266A1 (en) 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments
US17/249,520 US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962876497P 2019-07-19 2019-07-19
US16/947,107 US20210020266A1 (en) 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/249,520 Continuation US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments

Publications (1)

Publication Number Publication Date
US20210020266A1 true US20210020266A1 (en) 2021-01-21

Family

ID=74192449

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/947,107 Abandoned US20210020266A1 (en) 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments
US17/249,520 Abandoned US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/249,520 Abandoned US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments

Country Status (4)

Country Link
US (2) US20210020266A1 (de)
EP (1) EP4000070A4 (de)
CA (1) CA3147888A1 (de)
WO (1) WO2021016114A1 (de)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200321073A1 (en) * 2019-04-03 2020-10-08 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US11049589B2 (en) 2008-12-31 2021-06-29 23Andme, Inc. Finding relatives in a database
US11170047B2 (en) 2012-06-06 2021-11-09 23Andme, Inc. Determining family connections of individuals in a database
US11171962B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genome sharing
US11170873B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genetic comparisons between grandparents and grandchildren
US11514627B2 (en) 2019-09-13 2022-11-29 23Andme, Inc. Methods and systems for determining and displaying pedigrees
US11521708B1 (en) 2012-11-08 2022-12-06 23Andme, Inc. Scalable pipeline for local ancestry inference
US11531445B1 (en) 2008-03-19 2022-12-20 23Andme, Inc. Ancestry painting
US20230019141A1 (en) * 2021-07-07 2023-01-19 Mars, Incorporated System, method, and apparatus for predicting genetic ancestry
US11783919B2 (en) 2020-10-09 2023-10-10 23Andme, Inc. Formatting and storage of genetic markers
US11817176B2 (en) 2020-08-13 2023-11-14 23Andme, Inc. Ancestry composition determination

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050191731A1 (en) * 1999-06-25 2005-09-01 Judson Richard S. Methods for obtaining and using haplotype data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070277267A1 (en) * 1999-10-15 2007-11-29 Byrum Joseph R Nucleic acid molecules and other molecules associated with plants
US8463554B2 (en) 2008-12-31 2013-06-11 23Andme, Inc. Finding relatives in a database
US10777302B2 (en) 2012-06-04 2020-09-15 23Andme, Inc. Identifying variants of interest by imputation
US9977708B1 (en) 2012-11-08 2018-05-22 23Andme, Inc. Error correction in ancestry classification
EP3207483A4 (de) * 2014-10-17 2018-04-04 Ancestry.com DNA, LLC Menschliche vorfahrensgenome

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050191731A1 (en) * 1999-06-25 2005-09-01 Judson Richard S. Methods for obtaining and using haplotype data

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"Centimorgan." CentiMorgan - ISOGG Wiki, 10 July 2010, https://isogg.org/wiki/CentiMorgan. (Year: 2010) *
Browning, Brian L., and Sharon R. Browning. "Improving the accuracy and efficiency of identity-by-descent detection in population data." Genetics 194.2 (2013): 459-471. (Year: 2013) *
Browning, Sharon R., and Brian L. Browning. "High-resolution detection of identity by descent in unrelated individuals." The American Journal of Human Genetics 86.4 (2010): 526-539. (Year: 2011) *
Li, Hong, et al. "Relationship estimation from whole-genome sequence data." PLoS genetics 10.1 (2014): e1004144. (Year: 2014) *
Upton, Alex, et al. "High-performance computing to detect epistasis in genome scale data sets." Briefings in bioinformatics 17.3 (2016): 368-379. (Year: 2016) *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11683315B2 (en) 2007-10-15 2023-06-20 23Andme, Inc. Genome sharing
US11170873B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genetic comparisons between grandparents and grandchildren
US11171962B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genome sharing
US11531445B1 (en) 2008-03-19 2022-12-20 23Andme, Inc. Ancestry painting
US11468971B2 (en) 2008-12-31 2022-10-11 23Andme, Inc. Ancestry finder
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database
US11049589B2 (en) 2008-12-31 2021-06-29 23Andme, Inc. Finding relatives in a database
US11508461B2 (en) 2008-12-31 2022-11-22 23Andme, Inc. Finding relatives in a database
US11170047B2 (en) 2012-06-06 2021-11-09 23Andme, Inc. Determining family connections of individuals in a database
US11521708B1 (en) 2012-11-08 2022-12-06 23Andme, Inc. Scalable pipeline for local ancestry inference
US20200321073A1 (en) * 2019-04-03 2020-10-08 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US11848073B2 (en) * 2019-04-03 2023-12-19 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US11514627B2 (en) 2019-09-13 2022-11-29 23Andme, Inc. Methods and systems for determining and displaying pedigrees
US11817176B2 (en) 2020-08-13 2023-11-14 23Andme, Inc. Ancestry composition determination
US11783919B2 (en) 2020-10-09 2023-10-10 23Andme, Inc. Formatting and storage of genetic markers
US20230019141A1 (en) * 2021-07-07 2023-01-19 Mars, Incorporated System, method, and apparatus for predicting genetic ancestry

Also Published As

Publication number Publication date
CA3147888A1 (en) 2021-01-28
WO2021016114A1 (en) 2021-01-28
EP4000070A1 (de) 2022-05-25
US20210193257A1 (en) 2021-06-24
EP4000070A4 (de) 2023-08-09

Similar Documents

Publication Publication Date Title
US20210020266A1 (en) Phase-aware determination of identity-by-descent dna segments
US20230386611A1 (en) Deep learning-based variant classifier
US20230402132A1 (en) Error Correction in Ancestry Classification
US10600217B2 (en) Methods for the graphical representation of genomic sequence data
CA2964902C (en) Ancestral human genomes
US10699803B1 (en) Ancestry painting with local ancestry inference
US10192026B2 (en) Systems and methods for genomic pattern analysis
US10847248B2 (en) Techniques for determining haplotype by population genotype and sequence data
Freyman et al. Fast and robust identity-by-descent inference with the templated positional burrows–wheeler transform
US20080172209A1 (en) Identifying associations using graphical models
Llinares-López et al. Genome-wide genetic heterogeneity discovery with categorical covariates
Alkan et al. RedNemo: topology-based PPI network reconstruction via repeated diffusion with neighborhood modifications
Stram et al. SNP Imputation for Association Studies
Wu et al. A practical algorithm based on particle swarm optimization for haplotype reconstruction
Choudhury et al. HAPI-Gen: Highly accurate phasing and imputation of genotype data
Yang et al. Improved detection algorithm for copy number variations based on hidden Markov model
Zhang Biobank-scale ancestral recombination graphs: inference and applications to the analysis of complex traits
Lavenier Genome Assembly
Wang Missing SNP Genotype Imputation
Huang Computational Methods Using Large-Scale Population Whole-Genome Sequencing Data
Parbhoo Multilink Clustering: An Alternative Approach
NZ789499A (en) Deep learning-based variant classifier
NZ791625A (en) Variant classifier based on deep neural networks
Kretzschmar Methods for phasing and imputation of very low coverage sequencing data

Legal Events

Date Code Title Description
AS Assignment

Owner name: 23ANDME, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FREYMAN, WILLIAM A.;MCMANUS, KIMBERLY F.;SHRINGARPURE, SUYASH S.;AND OTHERS;SIGNING DATES FROM 20200806 TO 20200807;REEL/FRAME:053556/0220

STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION