NZ731808B2 - Reducing error in predicted genetic relationships - Google Patents
Reducing error in predicted genetic relationships Download PDFInfo
- Publication number
- NZ731808B2 NZ731808B2 NZ731808A NZ73180815A NZ731808B2 NZ 731808 B2 NZ731808 B2 NZ 731808B2 NZ 731808 A NZ731808 A NZ 731808A NZ 73180815 A NZ73180815 A NZ 73180815A NZ 731808 B2 NZ731808 B2 NZ 731808B2
- Authority
- NZ
- New Zealand
- Prior art keywords
- segment
- window
- matched
- individuals
- prob
- Prior art date
Links
- 230000002068 genetic Effects 0.000 title claims abstract description 42
- 238000009826 distribution Methods 0.000 claims description 42
- 239000012472 biological sample Substances 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 11
- 230000015654 memory Effects 0.000 claims description 7
- 230000000875 corresponding Effects 0.000 claims description 6
- 238000007476 Maximum Likelihood Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 abstract description 7
- 229920003013 deoxyribonucleic acid Polymers 0.000 description 10
- 238000005457 optimization Methods 0.000 description 8
- 210000000349 Chromosomes Anatomy 0.000 description 6
- 239000007787 solid Substances 0.000 description 6
- 230000037242 Cmax Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 210000004602 germ cell Anatomy 0.000 description 5
- 229940035295 Ting Drugs 0.000 description 4
- 230000003247 decreasing Effects 0.000 description 4
- 239000000523 sample Substances 0.000 description 4
- 238000007400 DNA extraction Methods 0.000 description 3
- 229920002393 Microsatellite Polymers 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 150000002500 ions Chemical class 0.000 description 3
- 230000021121 meiosis Effects 0.000 description 3
- 238000000034 method Methods 0.000 description 3
- 239000002773 nucleotide Substances 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000007670 refining Methods 0.000 description 2
- 101700071133 BCAM Proteins 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 241001176357 Imber Species 0.000 description 1
- GSDSWSVVBLHKDQ-UHFFFAOYSA-N Levofloxacin Chemical compound FC1=CC(C(C(C(O)=O)=C2)=O)=C3N2C(C)COC3=C1N1CCN(C)CC1 GSDSWSVVBLHKDQ-UHFFFAOYSA-N 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000003287 optical Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000000717 retained Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static Effects 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6881—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for tissue or cell typing, e.g. human leukocyte antigen [HLA] probes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/048—Fuzzy inferencing
-
- G06N7/005—
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B10/00—ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/40—Population genetics; Linkage disequilibrium
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Abstract
System, computer program products, and methods are disclosed for estimating a degree of ancestral relatedness between two individuals. The haplotype data for a population of individuals is divided into segment windows based on genetic markers, and matched segments for the haplotype data are generated. Each matched segment having a first cM width that exceeds a threshold cM width is included in counting the matched segments in each segment window. A weight associated with each segment window is estimated based on the count of matched segments in the associated segment window. A weighted sum of per-window cM widths for each matched segment is calculated based on the first cM width and the weights associated with the segment windows of the matched segment. The weighted sum of per- window cM widths are used to estimate a degree of ancestral relatedness between two individuals. d. Each matched segment having a first cM width that exceeds a threshold cM width is included in counting the matched segments in each segment window. A weight associated with each segment window is estimated based on the count of matched segments in the associated segment window. A weighted sum of per-window cM widths for each matched segment is calculated based on the first cM width and the weights associated with the segment windows of the matched segment. The weighted sum of per- window cM widths are used to estimate a degree of ancestral relatedness between two individuals.
Description
REDUCING ERROR IN PREDICTED GENETIC RELATIONSHIPS
CROSS NCE TO RELATED APPLICATIONS
The application claims the benefit ofUS. Provisional Application No.
62/063,849, filed on r 14, 2014, the contents of which are incorporated herein by
reference.
OUND
1. FIELD
The sed embodiments relate to computer program ts, systems, and
s used to identify individuals in a tion who are ancestrally related based on the
individuals’ genetic data.
2. DESCRIPTION OF THE RELATED ART
Although humans are, genetically speaking, almost entirely identical, small
differences in our DNA are responsible for much of the variation between individuals.
Stretches ofDNA that are determined to be relevant for some purpose are ed to as
haplotypes. ypes are identified based on consecutive single nucleotide polymorphisms
(SNPs) of varying length. Certain haplotypes shared by individuals suggests a familial
relationship between those individuals based on a principal known as identity-by-descent
(IBD).
Because identifying segments of IBD DNA between pairs of ped
duals is useful in many applications, numerous methods have been developed to
perform IBD analysis (Purcell et al. 2007, Gusev A. et al., The Architecture of Long-Range
Haplotypes Shared within and across Populations, M01. Biol. Evol., 29(2):473—86, 2012;
Browning SR. and Browning B.L. Rapid and accurate haplotype phasing and missing-data
inference for whole-genome association studies by use of localized haplotype clustering,
American Journal ofHuman Genetics, 91:1084-96, 2007; Browning SR. and Browning B.L.,
Identity by descent between distant ves: detection and ations, Annu. Rev. Genet.,
46:617-33, 2012). However, these approaches do not scale for continuously growing very
large datasets. For example, the existing GERMLINE implementation is designed to take a
single input file containing all individuals to be compared against one another. While
appropriate for the case in which all samples are genotyped and analyzed simultaneously, this
approach is not practical when samples are collected entally. The program suite
WO 61260
GERMLINE (Gusev A. et al., Whole population, genome-wide mapping of hidden
relatedness, Genome Res., 19:318—26, 2009) offers an “ibs filter”, which removes highly
frequent matches (defined by chromosome, as well as the start and end position on the
chromosome). Like GERMLINE’s matched segment discovery ch, the “ibs filter”
approach is designed to be fast, and is relatively simplistic as a consequence. The more
accurate of these methods, such as Refined IBD, are much more accurate than the
GERMLINE “ibs filter”, but they do not scale ationally and would be difficult to
integrate into an analytical pipeline even if they did. There are many existing methods that
assess evidence for a matched segment not just by centimorgan width, as is done within
GERMLINE. Examples include Refined IBD, fast IBD, SLRP, and PARENTE. They
ize that differences between these approaches are a tradeoff between model
xity and computational speed (and feasibility).
SUMMARY
Methods, s, and computer program products are disclosed for ting a
degree of ancestral relatedness between two individuals. The computer program products
include TIMBER. To estimate the ancestral relatedness of two individuals methods include
receiving haplotype data from a population of individuals. The haplotype data include a
plurality of c markers that are shared among the individuals in the population. The
ype data is then divided into segment windows based on the c markers. In some
embodiments, the genetic markers include single nucleotide polymorphisms (SNPs) and the
haplotype data is divided into K segment windows including an equal number d of SNPs. In
some embodiments, the haplotype data is d into 4105 t windows of 96 SNPs.
For each dual, the method includes matching segments of the haplotype data
that are identical between the individual and any other individual in the population, wherein
the matching is based on the genetic markers. Each matched segment has a first centimorgan
(cM) width that exceeds a threshold cM width. In some embodiments, the threshold cM width
is 5 cM. Each matched segment is part of one or more of the segment windows. The matched
segments in each segment window are then counted. The count of matched segments in a
segment window is also referred to as a per-window match count.
For each individual, the method es estimating a weight associated with each
segment window based on the count of matched segments in the ated segment window.
In some embodiments, the weight associated with a segment window is decreased as the
count of matched ts increases. The benefit of decreasing weights for increasing perwindow
match counts includes reducing the effect of matched segments that are likely not
from the recent genealogical history (RGH) of the individuals, but rather from a more distant
common ancestry at the human, ethnicity, or sub-ethnicity level.
For each individual, the method includes calculating a weighted sum of perwindow
cM widths for each d segment based on the first cM width and the weights
associated with the segment s of the matched segment. A degree of ancestral
relatedness between two individuals is estimated based on the weighted sum of ndow
cM widths of each matched segment between the two individuals. In some embodiments, the
degree equals the ed sum of of per-window cM widths. In some embodiments, the
weighted sum of per-window cM widths is the sum of the first cM widths for each segment
window of a matched segment between the two individuals lied by the two s for
each individual associated with these segment windows.
[0008a] In a particular aspect, the present invention provides a method for ting a
degree of ancestral relatedness corresponding to biological samples of two target individuals,
the method comprising:
genotyping the biological samples of the two target individuals;
extracting haplotype data of a population of individuals, the haplotype data including
a ity of genetic markers shared among the individuals;
dividing the haplotype data into segment windows based on the genetic markers;
for each individual in the population:
based on the genetic markers, matching segments of the haplotype data that
are identical between the individual and any other individual in the
population, each matched segment having a first cM width exceeding a
old cM width and being part of one or more of the segment
windows;
counting the matched segments in each segment window;
estimating a weight associated with each t window based on the count
of matched segments in the associated segment window;
calculating a weighted sum of per-window cM widths for each matched
segment based on the first cM width and the weights associated with
the segment windows of the matched segment; and
[FOLLOWED BY PAGE 3a]
estimating a degree of ancestral relatedness between the two target individuals based
on the ed sum of per-window cM widths of each matched segment
between the two target individuals;
wherein the degree of relatedness between the two target individuals comprises a
probability that the two target individuals are ancestrally related; and
outputting the probability indicative of the degree of relatedness of the two target
individuals that is determined based on analysis of the biological samples.
[0008b] In another particular , the present ion provides a system for
estimating a degree of ancestral relatedness corresponding to biological samples of two target
individuals, the system comprising one or more processors configured to execute a set of
steps and at least one memory ured to store the set of steps, the set of steps comprising:
genotyping the ical samples of the two target individuals;
extracting haplotype data of a population of individuals, the ype data including
a plurality of genetic markers shared among the individuals;
dividing the haplotype data into segment windows based on the genetic s;
for each individual in the population:
based on the genetic markers, matching segments of the haplotype data that
are cal between the individual and any other individual in the
population, each matched segment having a first cM width exceeding a
threshold cM width and being part of one or more of the t
windows;
counting the matched segments in each segment window;
estimating a weight associated with each segment window based on the count
of matched segments in the associated segment ;
calculating a weighted sum of per-window cM widths for each matched
segment based on the first cM width and the weights associated with
the segment windows of the matched segment; and
estimating a degree of ancestral dness between two target duals based on
the weighted sum of per-window cM widths of each matched segment
between the two target individuals
wherein the degree of relatedness between two individuals comprises a probability
that the two individuals are rally related; and
[FOLLOWED BY PAGE 3b]
outputting the probability indicative of the degree of relatedness of the two target
individuals that is determined based on analysis of the biological samples.
[0008c] In a yet further ular aspect, the present invention provides a non-transitory
computer readable medium for storing computer code comprising instructions, the
instructions, when executed by one or more processors, cause the one or more processors to:
genotype biological samples of two target individuals;
extract haplotype data of a tion of individuals, the haplotype data including a
plurality of genetic s shared among the individuals;
divide the haplotype data into segment windows based on the genetic markers;
for each individual in the population:
based on the genetic markers, match segments of the haplotype data that are
identical between the individual and any other individual in the
population, each d t having a first cM width ing a
threshold cM width and being part of one or more of the segment
windows;
count the matched segments in each segment window;
estimate a weight ated with each segment window based on the count of
matched segments in the associated t ;
calculate a weighted sum of ndow cM widths for each matched segment
based on the first cM width and the weights associated with the
segment windows of the matched segment; and
estimate a degree of ancestral relatedness between two target individuals based on the
weighted sum of per-window cM widths of each matched segment between
the two target individuals
wherein the degree of relatedness between two individuals comprises a probability
that the two individuals are ancestrally related; and
outputting the ility indicative of the degree of relatedness of the two target
individuals that is determined based on analysis of the biological samples.
[FOLLOWED BY PAGE 3c]
In some embodiments, TIMBER, an ancestry prediction machine matching
genetic markers, is a procedure operating on a computer for refining each individual’s list of
matched ts and prioritizing the matched segments that are most likely to be from the
duals’ recent genealogical history. TIMBER uses the matched segments to remove the
effect of “noisy” segment windows within the ype data that display an sive”
count of matched ts between numerous individuals. In some embodiments, a count is
excessive if the count is larger than 10 or 20. It is less likely that a matched segment is from
recent genealogical history if a matched t is mainly part of “noisy” segment windows.
TIMBER tes weights from the matched segment data and estimates a weighted sum of
per-window cM widths of a matched segment based on discounting “noisy” segment
windows. TIMBER is computationally efficient and scalable, allowing reevaluating an entire
population of individuals each time new individuals are added to the population.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure (Fig.) 1A illustrates a flowchart of a method for ting a degree of
ancestral relatedness between two individuals, according to some embodiments.
Fig. 1B is a block diagram of a computing environment for estimating a degree of
ancestral relatedness between two individuals, according to one embodiment.
Fig. 2 illustrates an example of per-window match counts on a chunk of the
genome for one individual, according to some embodiments.
[FOLLOWED BY PAGE 4]
Fig. 3 illustrates an example of the ram of per-window match counts for all
of the windows in the genome for one individual, according to some embodiments.
Fig. 4 is an example of the histogram of per-window match count for all the non-
zero count windows where the maximum viewable per-window count is 40, according to
some embodiments.
Fig. 5 is an example of the histogram of per-window match counts for all the non-
zero and also low match count windows, according to some embodiments.
Fig. 6 is an example of the ted per-window weight (in the y-axis) as a
function of the possible per-window match count (in the ), according to some
embodiments.
Fig. 7 is an example of the estimated per-window weight (in the y-axis) as a
on of the possible per-window match count (in the x-axis), according to some
ments.
Fig. 8 illustrates an example for the weights (solid line) given the ndow
counts (dashed line) from the original example in according to some embodiments.
Fig. 9 is an example of per-window match counts on a chunk of the genome for
one individual both pre-TIMBER (dashed line) and IMBER (solid line), according to
some embodiments.
Fig. 10 illustrates results from TIMBER using different unweighted cM width
scores, i.e., f1rst cM width filters, and weighted cM sum scores, ing the matched
percentages of segments kept for the known and unknown meiosis, according to some
embodiments.
The figures depict an embodiment for es of illustration only. One skilled in
the art will y recognize from the ing description that alternative embodiments of
the ures and methods illustrated herein may be employed without departing from the
principles described herein.
DETAILED DESCRIPTION
1. OVERVIEW
Methods, systems, and computer program products are disclosed for estimating a
degree of ancestral relatedness between two individuals. Estimating the ancestral relatedness
of individuals includes identifying and scoring identical-by-descent (IBD) matched segments
among the haplotype data of these individuals. To identify IBD segments, the method
compares c markers among the individuals’ haplotypes. In some embodiments, genetic
markers e -nucleotide polymorphisms (SNPs). Segments from two individuals
are considered identical by state (IBS) if the genetic markers along the individuals’ haplotype
ces in these segments are identical at the same loci along the haplotypes. Throughout
the disclosure unless otherwise stated, “matched haplotype segments” or “matched segments”
refer to identical haplotype segments shared n two or more duals. Generally, an
IBS segment shared between two individuals is identical by descent (IBD) if the individuals
inherited the IBS segment from a common ancestor, g the same ancestral origin. Thus,
any IBD segment by definition also represents an IBS segment, while the reverse is typically
not true, i.e., an IBS segment might not represent an IBD segment. Moreover, many IBD
segments are not from the recent genealogical y (RGH) of duals, but rather from a
more distant common ancestry at the human, ethnicity, or sub-ethnicity level. The disclosed
method allows for prioritizing matched segments that are more likely to be from the
individuals’ RGH over those segments that are from a more distant common ancestry, thus
belonging to their distant ancestral past, i.e., cent genealogical history (non-RGH).
Fig. 1A is a flowchart illustrating a method 100 for estimating a degree of
ancestral relatedness between two individuals, according to some embodiments. The method
allows the user to input segments that are categorized as matched or discovered, i.e., having a
first centimorgan (cM) width of over 5 cM. The method in form of the TIMBER program
then uses those matched segments to calculate a weighted sum of per-window cM widths.
The weighted sum of per-window cM widths takes into account the count of matched
segments to other duals of the population in the segment windows, down weighting
segment windows that display a high degree of matched segments to many duals.
In some embodiments, the inputs to TIMBER program are the pairwise d
segments between all duals in a population stored in a database. The pairwise matched
segments are translated into weights for each individual’s matchable t windows of the
genome by the TIMBER program. The TIMBER program then uses the weights to re-
calibrate or re-score the original se matched segments.
In some embodiments, TIMBER programs, wherein the TIMBER programs are
stored in memory and configured to be executed by one or more processors of a computing
device, the TIMBER ms including instructions when executed by the computing
device cause the device to:
l. calculate the counts of matched segments in each segment window of the
haplotype data, for each individual of the population, where the matched
segments are between the individual and every other person in the database,
2. calculate weights for each individual and segment window, and
3. calculate a weighted sum of per-window cM widths for each d segment
n two individuals based on the weights.
The method 100 is performed at a computing device, such as the computing
device, as may be controlled by specially programmed code (computer programming
ctions) contained, for example, in the TIMBER program, wherein such specially
programmed code is or is not natively present in the computing device. Embodiments of the
computing device include, but are not limited, general-purpose computers, e. g., a desktop
computer, a laptop er, computing servers, tablets, mobile devices, or any similar
computing devices. Once programmed to execute the methods described here, such a
computing device becomes a l-purpose computer. Some embodiments of the method
100 may include fewer, additional, or different steps than those shown in Fig. 1A, and the
steps may be performed in ent orders. The steps of the method 100 are described with
respect to example haplotype data illustrated in Figures (Figs.) 2 through 9.
Fig. 1B is a block diagram of an environment for using a computer system 120 to
estimate a degree of ancestral relatedness between two individuals, according to some
embodiments. Depicted in Fig. 1B are individuals 122 (i.e. a human or other organism), a
DNA extraction service 124, and a DNA y control (QC) and matching preparation
service 126.
Individuals 122 provide DNA samples for analysis of their genetic data. In some
embodiment, an individual uses a sample collection kit to provide a sample, e. g., , from
which genetic data can be reliably ted according to conventional methods. DNA
extraction service 124 receives the sample and genotypes the genetic data, for example by
ting the DNA from the sample and fying values of SNPs present within the DNA.
The result is a diploid genotype. DNA QC and matching preparation service 126 assesses
data quality of the diploid genotype by checking various utes such as genotyping call
rate, genotyping heterozygosity rate, and agreement between genetic and eported
gender. System 120 receives 102 the ype data from DNA extraction service 124 and
optionally stores the haplotype data in a database 128 containing unphased DNA diploid
genotypes, phased ypes, and other genomic data. Unless otherwise stated, haplotype
data refers to any genetic or genome data ed from the individuals 122, which is
optionally stored in database 128.
In some embodiments, the partitioning module 130 divides 104 the haplotype data
into segment windows based on the genetic s. In some embodiments, the matching
module 132 matches 106 segments of the haplotype data that are identical between the
individual and any other dual in the population, where each matched segment has a first
cM width that exceeds a threshold cM width and is part of one or more of the segment
window.
In some embodiments, the count/weight estimation module 134 counts 108 the
matched segments in each segment window and estimates 110 a weight associated with each
segment window based on the count of matched ts in the associated segment window.
The scoring module 136 then calculates 112 a weighted sum of ndow cM
widths for each matched segment based on the first cM width and the weights associated with
the segment windows of the matched segment. In some embodiments, the scoring module
136 estimates 114 a degree of ancestral relatedness between two individuals based on the
weighted sum of per-window cM widths of each d segment n the two
individuals.
11. EMBODIMENTS OF THE ANCESTRAL RELATIONSHIP TION PROGRAMS
In some embodiments, matched segments among a population of individuals are
generated based on the individuals’ haplotype data. In some embodiments, the matched
segments are stored in a database for later retrieval by the TIMBER program. The TIMBER
m is configured to receive all the matched segments among a population of individuals
and prioritize those matched segments that are from the individuals’ recent genealogical
y (RGH).
HA. Match Hagloflge Segments
The method 100 includes receiving 102 haplotype data for a population of
individuals, the haplotype data including a plurality of genetic markers shared among the
individuals, according to some embodiments. In some embodiments, to identify (match) and
score IBD segments among the haplotype data, the method 100 uses a HADOOP®
reimplementation of a matching algorithm. The method benefits from being computationally
fast and scalable for -sized populations. In some embodiments, the matched IBS
segments based on the haplotype data of individuals include the individuals’ RGH and non-
RGH segments. The matched IBS segments are generally referred to as matched segments. In
some embodiments, matched segments are ted using methods that are well known in
the art. Using, for example, the TIMBER program, the method 100 then prioritizes the
matched segments into segments that are more likely from the RGH of two individuals
ed to from their non-RGH by calculating a TIMBER score that quantifies the
hood of two individuals sharing a common recent ancestral relationship.
The method for determining the TIMBER score is based on the assumption that
the locations (loci) of matched segments from an individual’s RGH are evenly distributed
across an individual’s genome. For example, by dividing the SNPs of the chromosomes
across the individuals’ genome into discrete windows, the counts of the segments in a
specific window on chromosome 1 are independent of the counts of the segments in a
window on chromosome 14. Consequently, matches between segments of duals in a
window on some 1 are therefore independent from matches to segments in the
window on chromosome 14, resulting in an even distribution across all windows.
Furthermore, matched segments that do not originate from an individual’s RGH
exhibit spikes at certain windows of the genome while being evenly buted across the
remaining windows. In some instances, these spikes can be attributed to particular reasons for
the c variation in these windows, e.g., at a person level or at the database level. At the
person level, the spikes may result from the segments of a window displaying a high level of
sequence similarity across a particular ethnicity group, while at a database level, the
individuals of population in the database may not possess any SNP variations across a
particular window based on how the population was selected. In particular, it is unclear
whether the distribution of the counts of matched segments in a window originates from an
individuals’ RGH, and whether factors that confound local spikes of matching are due to
unknown non-RGH reasons, which are difficult to model. To overcome the problems
associated with these spikes, the method introduces weights for each window for rescoring all
matched ts, in ularly, the ones that contribute to the spikes in windows. In some
instances, d segments ly occur at short and specific location of an individual’s
genome while matching a very large number of other individuals, e.g., larger than 1,000
individuals.
The method 100 includes dividing 104 the ype data into t windows
based on the genetic markers, according to some embodiments. The haplotype data includes,
but is not limited to, SNPs observed across the individual’s genome. In some embodiments,
the haplotype data includes the haplotype data of an individual’s entire or partial . In
some embodiments, the method 100 divides the observed SNPs into K s of equal size
d, with each window, for example, including 96 SNPs. Other examples for window sizes
include 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150 or any number that falls within the
range of 50 to 500. In some embodiments, the size of each window, i.e., the number of SNPs
per window varies for across the windows.
For example, some windows may include only 50 SNPs depending on the
sequence length of these SNPs, while other windows include 96 SNPs. In some
embodiments, the windows include other genetic markers besides SNPs that are used to
identify matched ts. These markers include, but are not limited to, restriction nt
length polymorphisms, simple sequence length polymorphisms, amplified fragment length
polymorphism, random amplification of polymorphic DNA, le number tandem repeat,
simple sequence repeat microsatellite polymorphism, short tandem repeats, single feature
rphisms, ction site associated DNA markers, and the like.
In some embodiments, the method 100 includes using phased haplotype data, i.e.
data for which the phase has been ted, as input to identify matched segments. For this,
the method uses the ype data for a population of n individuals. In some embodiments,
the input of the haplotype data is represented as a 2n x 5 matrix H with rows corresponding to
211 haplotypes and columns to s SNPs. By vertically slicing H into non-overlapping, equal
width submatrices Hi of d s, each submatrix Hi then represents a different segment
window 2', where i = 0 K and S = d - K. In some embodiments, the haplotype data
includes n implicit non-phased haplotypes of the population using the tion’s genotypes
to determine possible haplotype s without explicit haplotype matching that would
require the phase of the haplotypes to be known. Haplotype matching therefore refers to
implied as well as explicit haplotype matching, where the former is based on non-phased
genomic data of a population.
The method 100 includes for each individual in the tion, based on the
genetic markers, matching 106 segments of the haplotype data that are identical between the
individual and any other individual in the population, according to some ments. In
some embodiments, the Each matched segment has a first cM width that exceeds a threshold
cM width and is part of one or more of the segment windows. The matching 104 includes
identifying segment s of exact haplotype matching between two individuals in the
population. s of exact haplotype matching are used to anchor the identifying the
entire matched segment that in some instances extends beyond the initial exact window
match. In some embodiments, the method 100 includes extending the exact window match
until two homozygous mismatching SNPs are observed on either side of the original exact
window match. As a result, the method 100 determines the segment width by determining
the m and maximum of the start and end locations of the windows with no
homozygous mismatches and extending the exact window match.
In case that the determined segment width exceeds a threshold cM width threshold
width, the method 100 identifies the corresponding segment as a matched segment. In some
ments, the old cM width is 5 cM. In some embodiments, the threshold cM width
threshold is 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 CM or any value larger than 5 CM. In some
embodiments, the method 100 matches 104 segments of duals ed in a database of
individuals that includes the haplotype data of each individual’s genome.
II.B. Per-window Match Count 0 Matched Se ments
The method 100 includes for each individual in the population counting 108 the
matched segments in each segment window, according to some embodiments. For each
individual in the population and every window, the method 100 determines a per-window
match count of matched ts. A particular per-window match count kl- refers to the
number ofmatched segments identified within the population with 2' indicating the window of
the count. In some embodiments, since every matched segment between two individuals
spans a number of windows, the method 100 translates matched segments of all individuals
into the window identifiers and counts how many times each window is part of a matched
segment. Matched segments of close relatives are not included in the per-window match
count, which ses the likelihood that all matched segment share similar levels of
uncertainty of whether or not the matched segment is part of the individuals’ RGH. In some
embodiments, a close relative is defined as an individual with whom the related individual
has a raw TIMBER score that equals or exceeds a fined close relative threshold. In
some embodiments, the pre-defined close relative threshold is 50 cM. In some embodiments,
the fined close relative threshold is 30, 40, 50, 60, 70, 80, 90, 100 cM or any value
from the range of 30 to 200 cM. A raw TIMBER score between two individuals is defined as
the sum of the first cM widths of matched segments over all matched segments between the
two individuals. The score is referred to as a “raw” score, since the cM widths used are the
first cM widths that are not down-weighted.
In some embodiments, the method of determining the per-window match count for
each individual includes the following steps:
1. initializing a per-window match count vector {ki}i=0___K to zero counts (one value k,-
for every matching window in the genome, e.g., with K equal to 4105 windows, and
each window including 96 SNP markers; and
2. for every d segment:
(a) skipping d t for close relatives and move to the next matched
segment,
(b) translating the d segment into a vector of window indices {2'} (that the
matched segment spans), and
(c) incrementing the per-window match count vector entries {ki} by one for
respective values of the vector of window indices {1'}.
3. removing any entries from the ndow match count vector {k5 ki > 0}i=0___K
having a matched segment count of zero.
Fig. 2 illustrates an example of per-window match counts on a portion of the
genome of a particular individual in the population. The x-axis shows the number of a
particular window i along the haplotype ce, and the y-axis represents the total number
of matched segments k,- for a particular window i. In this example, the displayed windows
range from number 1 to number 145. Fig. 2 displays windows with zero matched segments,
while these windows are excluded from the per-window match count vector {k5 ki >
0}i=0___K. The largest match counts are observed for window number 1 and number 2 in this
example, reaching close to 60 matched segments.
Fig. 3 rates an e of the histogram of the per-window match counts
{c: c = 0 Cmax, Cmax S n} for all the windows in the entire genome for an individual,
which only shows a portion of the individual’s genome. The ram indicates the
frequency of observing matched segments against the entire population n in each genomic
window, wherein n therefore limits the maximal count per window Cmax. Only s
ing less than 305 matched segments are displayed with the count of s rapidly
decreasing for increasing per-window match count values. In this example, windows
including zero matched segments are not shown, since these windows are not fithher
analyzed in the method 100 of calculating the TIMBER score.
As shown in Fig. 2, certain segment windows or regions within the individual’s
genome display a very large count of matched segment, whether due to a high level of non-
RGH matched segments for n reason or due to truly high distribution of matched
RGH segments in these windows. The method 100 attempts to differentiate between these
two possibilities by evaluating the per-window match count for every individual in the
population. Furthermore, the method 100 determines a TIMBER score for each matched
segment by down weighting matching windows of a matched segment that have a decreased
likelihood to originate from the RGH between the two matched duals. This down
weighting, in some embodiments, es determining a weight for each individual for a
matched segment in a window.
II. C. Estimating Weights {WE-A]
In some ments, the method 100 includes estimating l 10 a weight
associated with each segment window based on the count of d segments in the
associated segment window, according to some embodiments. In some embodiments, the
estimating 110 includes determining for each individual in the population a weight for each
window that counts at least one matched segment. In some embodiments, the weight is
approximated by the probability that matching in that window for that individual provides
evidence for RGH. This probability is d to the count of matched segments in a window.
A window with an ely high matched segment count for an individual is very unlikely
due to the RGH that the individual shares with other individuals in the tion. Unknown
factors other than RGH may t for a very high d segment count as described
above.
To estimate the weights, the method 100 determines the probability of
RGH, Prob(RGH|�� = c), given the measured count c of matched segments in a window.
The random variable C represents all possible counts of matched segments in a window and
is assumed to be the identical across all windows. By measuring the actual counts c in
particular s, the method 100 determines the probability of RGH on the condition that
C equals c for this window. To determine Prob(RGH|�� = c), the method 100 uses Bayes
theorem that provides:
Prob (�� = �� )Prob (RGH)
Prob(RGH|�� = c) = (1),
Prob (�� = c)
where Prob (�� = c|RGH) is the probability of having c matched segments in the window and
all matched segments are due to the individuals’ RGH, Prob (�� = c) is the probability of
having c matched segments in the window regardless of the matched segment being from
RGH or non-RGH, and Prob (RGH) is the probability that the matched segment is from the
individuals’ RGH. Based on the Bayes m, the method 100 determines estimates of
Prob (�� = c|������ ), Prob (RGH), and Prob (�� = c), wherein Prob k�� ã= �� +RGHo,
Prob(RGH)ã , Prob(��â = c) are the estimates of Prob(�� = c|������ ), Prob(RGH) and
Prob(�� = c), respectively. The method 100 then determines the weight �� º
Ü for an individual
A in a specific window i according to:
Prob k�� ã +RGHoProb(RGH)ã= ��
�� º = *
Ü (2).
Prob(��â = �� Ü)
In some embodiments, since it is difficult to estimate Prob(�� = c|RGH), the
method 100 generates at least two slightly different tes of Prob(�� = c|RGH), and then
selecting the te of the at least two estimates that s in the greatest down weighting
of C for determining the weight wl-A.
II. C]. Determine Prob/PC = C] Estimate
In some embodiments, the method 100 includes determining an te of the
probability distribution of C, Probfa = C), which provides the likelihood that a window has
a given matched t count 0 for all the possible counts. These embodiments provide a
more accurate prediction of the ility in the number of people that an individual
matches in one window based on the probability of matching each individual in the
population. For example, the value of Probfa = 20) is the probability of counting 20
d segment in a given window for the entire haplotype data, including all matched
segment of the population. Probfa = C) es contributions from matched segments that
are from RGH and non-RGH of the population’s individuals. Fig. 4 illustrates Probfa = C)
of one individual of the population based on the individual’s entire genome. Given
Probfa = C) and Prob (C = CIRGH), the method 100 is able to quantify the likelihood that a
window convey information about RGH for a given count 0. In some embodiments, both
distributions estimate distribution of counts for counts that are greater than zero, i.e. windows
that have at least one d segment within them.
Figs. 3 and 4 illustrate Probfa = C) in form of a histogram of per-window match
counts for all non-zero count windows without counting any windows that include zero
segments. In some embodiments, only windows with zero segments are counted for at least
the following two reasons: 1) different biological or observational process are the likely cause
for discovering a matched segment within a window as compared to the number of
discovered matched ts within that window for a given population; and 2) the weighing
ofwindows should be based on matched segments that are actually present in and not missing
from a population. Thus, these embodiments avoid effectively assigning a weight of one to
windows with zero discovered matched segments. In ison, Fig. 2 illustrates values of
the counts for a specific window (and not the frequency with which that count is observed
across all the non-zero windows).
To ine a bution to the likelihood that a particular count is observed in
a , i.e., Prob(C = C), the method 100 fits a distribution of observed counts in all the
non-zero windows illustrated as a ram in Fig. 4.
In some embodiments, the method 100 uses a f1rstbeta-binomial bution to
approximate the distribution of the per-window counts, Probfa = C), which represents the
probability that each individual matches another individual in the population. The advantage
2015/055579
of using a beta-binomial distribution is that it is able to account for the underlying
heterogeneity in the probability of ng individuals t identifying the reason for the
heterogeneity. As illustrated in Fig. 4, the beta-binomial distribution es a good fit to
per-window counts in an individual. In comparison, if all individuals in the population are
assumed to match with an equal chance to any other individuals in the population, the
binomial distribution es a model of the observed counts in all the non-zero windows. A
binomial distribution would typically be used to model the probability of a number of events
being successful, if the known ility of success is shared across all the independent
events.
Fig. 4 further illustrates an example of fitting a binomial distribution to the per-
window counts in an individual as compared to a beta-binomial distribution. Generally, the
beta-binomial distribution is preferred for modeling Probfa = C). In particular, Fig. 4
illustrates an example of the histogram of per-window match count for all the non-zero count
s where the maximum displayed per-window match count is 40. The binominal
distribution modeling Probfa = C) is shown as a dashed line with the solid line indicating
the fit of the beta-binomial distribution to the histogram data.
In some embodiments, the method 100 determines two parameters a and B of the
beta-binomial distribution to determine the optimal fit between the inomial distribution
and the per-window match count of matched segments for one individual based on the entire
population of individuals. Since beta-binomial distribution is defined for counts from zero to
a maximal count n that equals the population size, the method 100 uses a modified per-
window match count vector that is ed by subtracting one from each element of
{k5 ki > 0}i=0___K. The number of ations used to determine the beta-binomial
distribution equals the number of windows K across the haplotype data. For example, the
haplotype data is divided into 4105 windows, each window including 96 markers. The joint
inomial distribution f ({kflln, a, B, ki > 0) is given by:
n B(ki—1+a,n—ki+1+fi’)
f({ki}|n,a,fi,ki > 0) 2 1—“ki_1) (3),
B(a,fi’)
where {k5 ki > 0}i=0...K is the vector of per-window counts (for the K windows with at least
one matched segment), n is the population size, a and B are parameters of the distribution,
and B is the beta filnction. Prob(C = C) for a population size n is then given by:
Pr0b(C=C|C>0)=(Cn)B(c+a,n—c+fi’)Bcam (‘0'
The method 100 uses the haplotype data of all duals who match an
individual to determine the parameters a and B of the distribution ated with this
individual. This ination is based in part on the assumption that count of each window
is ndent from the count of any other window. For typical examples, this assumption
provides a good estimation of actual data. In some instances, this assumption might not hold
true, and other distribution ons can provide better approximations to the data.
In some embodiments, the method 100 uses maximum-likelihood estimation to
determine the two parameters a and B of the joint beta-binomial distribution f ({ki}|n, a, fi).
The observed per-window match count vector {k5 ki > 0}i=0___K represents at most K fixed
parameters of the likelihood function, L (a, B |{ki}, n) = f , a, fi). The parameters
a and B are real numbers larger than zero that maximize the average logarithm of the
likelihood function.
In some embodiments, the method 100 uses an optimization method 100 for
estimating the maximum-likelihood L (a, B |{ki}, n) of the joint inomial distribution for
given {ki} and n. Various optimization methods are well-known in the art, each of which can
be used for the m-likelihood estimation. In some embodiments, the method 100
s the “Nelder-Mead” simplex optimization (Nelder J.A. and Mead R., A simplex
algorithm for Function Minimization, Computer J., 7308-13, 1965) algorithm as the
optimization method 100. The parameters a and B are initially set to starting values denoted
by c”: and ,67 provided by:
nEl—EZ
N a:—(5)
E2 ,
"(fi‘El—1)+El
~ (n—E1)-(n—fi)
3:152— (6),
n(fi—E1—1)+E1
K. k-—1 K_
Where E1=$ and E2: 1—18:k-—12)
The parameter a and B obtained by the optimization method 100 are denoted by
c? and B. All individuals thus have an estimated beta-binomial distribution to describe
PI‘Ob(C=CIC>0)=(:) B(c+&,n—c+§) where 62 and B are specific to each individual in a B“? 3) ,
population of size n, illustrated in the example of Fig. 4.
II. C2. Determine Prob] C’;C|RGH [ Estimates
In some embodiments, the method 100 ines an estimate of the probability
distribution Prob(C’;c|RGH) by fitting a second beta-binomial distribution to the per-
window match counts that only include windows with a low match count while excluding
windows with a higher match count. The first beta-binomial bution is based on the fit
to Prob(C’;c), as described above.
In some embodiments, the method 100 excludes any per-window match counts
that exceed a threshold value V. Excluding s with higher match counts is based on the
rationale that windows with low match count are mainly due to matched ts from the
individuals’ RGHs, while high match count windows likely include matched segment that are
a mixture of RGH and non-RGH. Generally, Prob (6";CIRGH) can be estimated from
Prob (C = c), as the later distribution represents the sum of two conditional distributions
Prob(C = c|RGH) and Prob (C = c|non-RGH ). However, neither of these conditional
distributions are known with any confidence or can be easily ined from Prob (C = c),
in part because one cannot distinguish between matched segments that are from the
individuals’ RGHs and matched segments that are not. Attempts in estimating
Prob(C’;c|RGH) directly from Prob(C = c) therefore result often in poor and sometimes
very poor tes, while adding various computational problems.
rmore, Prob(C = C) is well estimated by a beta-binomial distribution as
described above. To estimate Prob (C = c|RGH), the method 100 therefore fits a second beta-
binomial distribution to the per-window match counts based on windows having a match
count below or equal to the threshold value V.
As illustrated in Fig. 5, both fitted butions of Prob(C = c) and
Prob(C = c|RGH) are similar to each other, if the bution of counts for the high value
windows that are not ed in the fit of Prob(C’;c|RGH) is consistent with the
bution estimated from only the low match count windows. Fig. 5 shows the first beta-
binomial distribution based on all the non-zero count windows (shown by the solid line), the
second beta-binomial distribution based on only the low match count windows (shown by the
dashed line), and the histogram based only on the data from the low match count windows.
The maximum displayed per-window match count is 40.
In some embodiments, since the estimate Prob (C’;C|RGH) is likely sensitive to
the user-specified threshold value V, the method 100 determines at least two estimates of
Prob (�� = c|RGH, V) using at least two ent values of V. The method 100 then selects
the estimate that results in smaller weights for the windows, more effectively downweighting
the individuals’ matched segment counts as described in more detail below. In
some embodiments, the method 100 determines only one estimate of Probk�� ã= �� +RGHo. In
some embodiments, V is specified as the maximum value of a minimum old value Vmin
and a quantile of all the counts in windows with at least one matched segment. In some
embodiments, the minimum threshold value Vmin is set to 11. In some embodiments, the
minimum threshold value Vmin is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 20, 25, 30, or
larger than 30. In some embodiments, at least two quantiles are specified to determine the at
least two estimates of Probk�� ã= �� +RGH, �� o. In some embodiments, two specified quantiles
are 75% and 90%, respectively. In some embodiments, the first specified quantile equals to
50%, 55%, 60%, 65%, or 70%, and the second specified quantile equals to 65%, 70%, 75%,
80%, or 85%. In some embodiments, the two quantiles are ied so that the difference
between the two quantiles is in the range of 10%-20%, and the smaller le equals to
50%, 55%, 60%, 65%, 70%, 75% or any value in the range of 40%-80%.
In some ments, to te Prob(�� = c|RGH, �� ) the method 100 uses a
joint beta-binomial distribution �� ({�� Ü}|�� , �� , �� , �� , �� Ü > 0) that depends on V, �� and �� and is
given by:
∏Æ á B(�� Ü − 1 + �� , �� − �� Ü + 1 + �� )
Ü@5 @ A
àÔ?5 B(�� , �� )
�� ({�� Ü}|�� , �� , �� , �� , �� Ü > 0) = (7).
Prob(0 < ��ã≤ �� |�� , �� )
where {�� Ü:0 < �� Ü ≤ �� }Ü@4…Æ is the vector of per-window counts (for the M
windows with at least one matched segment and less than or equal to V ), n is the
population size, �� and �� are the ters for the beta function B, and
Prob(0 < ��ã≤ �� |�� , �� ) is the probability that a per-window count of matched segment is
greater than zero and less than or equal to V conditional to �� and �� .
The probability distribution of C conditional to RGH is then given by:
B(�� + �� , �� − �� + �� )
ã Ö B(�� , �� )
Prob k�� = �� +RGH,V,c > 0o = Y (8),
Prob(0 < ��ã≤ �� |�� , �� )
wherein Prob(0 < ��ã≤ �� |�� , �� )represents a normalization factor that is a function of
�� and �� and given by:
Prob(0 < ��ã
�� B(�� + �� , �� − �� + �� )
≤ �� |�� , �� )= Í@ A (9).
�� B(�� , �� )
In some embodiments, the method 100 uses maximum-likelihood estimation to
determine the two parameters �� and �� of the joint inomial distribution
�� ({�� Ü}|�� , �� , �� , �� ) using the same or similar optimization algorithms that are used in
estimating Prob(C = �� ) as described above. In some embodiments, the method 100 uses the
same or similar starting values for �� and �� as are used to estimate Prob(�� = c). Using the
same starting values for the first and the second joint beta-binomial distribution minimizes
the effect of the starting values on the differences between these two distributions.
. Determine Prob(RGH)ã Estimate
In some embodiments, the method 100 then determines an te of Prob(RGH)ã
based on Prob(��â = c) and � ã= �� +RGHo. In some ments, Prob(RGH)ã is set to
be maximum of , which is equal to , where m is the point
at which is maximized. Thus, the method 100 only implicitly estimates
Prob(RGH)ã by evaluating the ratio Probk�� ã= �� +RGHo / Prob(��â = c). In some
embodiments, the method 100 determines at least two estimates of Prob (RGH, V) based on
the at least two Prob(�� = �� |RGH, V). In some embodiments, the method 100 determines two
Prob(RGH, V)ã , estimates based on the two user-specified threshold values V, where each V
are ined on two specified quantiles of all the counts in windows with at least one
d segment as described above. Determining the estimate of Prob(RGH) as a ratio of
two probabilities ensures that the estimate as well as the ponding weights fall with the
range of zero to one, since the weight is also determined as a ratio of three probabilities that
fall within the range of zero to one. The weight would be undefined, if Prob(��â = c) is zero.
The above described estimation using a beta-binomial distribution ensures that any values of
Prob(��â = c) are larger than zero.
II.C.4. Estimating Temporary Weights {wc} For Each Estimate of Prob(�� = �� |RGH)
The method 100 then determines ary weights based on the estimates of
Prob(��â = c), Probk�� ã= �� � o, and Prob(RGH)ã . These weights are temporary, since the
method 100 uses these weights to determine the final weight for each window. The temporary
weights can be represented by a vector {wc}c = 0…n that is a series of values for different match
counts and W0 is the raw weight for a match count of c. Given the at least two estimates of
Pr0b(C = CIRGH) in some embodiments, the method 100 determines temporary weights for
each estimate of Pr0b(C = CIRGH) and then selects one series of ary weights to
determine the final weight.
If any optimization fails for one quantile it is ignored in the decision step, hence if
temporary weights can only be estimated for one choice of quantile, then those temporary
weights are the final weights. Considering multiple options for the quantile ensures that we
are not missing a good weight vector to down weight purely because of the choice of a fixed
le for all individuals. The choice of quantile for the estimation of Prob(C = CIRGH) is
made for each individual by the observation which quantile most down weights the counts.
To determine the temporary weights, the method 100 initially determines raw
s {rc}c:0...n based on the estimates of Probfa = C), Prob(C’=\c|RGH), and
Prob/(EGH), wherein rc is the raw weight for a match count of c. In some embodiments, the
initial values of raw weights {re}c : 0...n are determined by using equation 2. Subsequently, the
method 100 determines the temporary weights from the raw weights so that the ary
weights satisfy the following three ions:
1. the temporary weight of a window with one match in it has a weight of one,
i.e. wc:1 = l;
2. the values of the temporary weights monotonically decreases with increasing
match count c in a specific window, i.e. wC > Wc+1 for all match counts c; and
3. the temporary weights t to one for all s, if the estimates of
Pr0b(C = C), Pr0b(C = CIRGH) or Pr0b(RGH) are poor.
In some embodiments, estimates of these probabilities are considered poor if the
optimization algorithm used to determine the beta-binomial distributions fails, the number of
ndow match counts falls below threshold value, e. g., 20, or the number of points used
for fitting the beta-binomial distribution is below a minimum number.
The first two conditions can be met by enforcing that the weight monotonically
decreases as the match count in a specific window increases. The rationale for a t
weight of one is to avoid introducing more rather than removing noise in the TIMBER score
calculation, since estimating ilities of the weights based on underrepresented matched
segment counts likely introduce noise in the calculation. Furthermore, since only low per-
window match counts are measured in this case, the method 100 would take all d
segment counts, which are mainly low per-window match , into consideration without
weighting down any particular matched segment counts. More specifically, the tion the
probabilities using the beta-binomial distributions is limited when interpreting low match
count .
In some embodiments, the temporary weights {we}c : 0...n are determined by the
following two steps that are tent with the first two above conditions, i.e. the first
temporary weight equaling one and the temporary weights monotonically sing with
increasing match count. In the first step, the method 100 sets the temporary s to one
for all windows with a match count less than or equal to M, wherein M is the count for which
the ratio between the two beta-binomial ted distributions is highest. In the second step,
after the count for which Prob (RGH) was “estimated,” any increase in the weight with
respect to increased match count is changed to be a zero increase. In some embodiments, the
method 100 performs the two steps for all c by applying the following algorithm (except
when ng the default weight because of poor probability tes):
1. W, = 1, ifc s M (10),
0, ifrc > rc_1
2' dc _ l_ (11),
TC — rc_1, else
WC: 1+ 2 di, ifc >M (12).
i=M+1
Figs. 6 and 7 illustrate an example of the per-window temporary weights {we} (yaxis
) as a filnction of the possible per-window match count 0 (x-axis). In particular, Fig. 7
shows more detail than Fig. 6, since the maximum value of the per-window match count 0
displayed along the x-axis is set to 40. In this example, M is 4.
II. C. 5. Estimating the final weights {wi};
The method 100 includes determining final weights {wi}i=1___K for weighting all
matched segment windows based on the temporary weights {Wm/L: . In some
0 ...cmax,v
embodiments, the final weight wi is the temporary weights for a given estimate of
Pr0b(C = V), i.e. a given V, which minimizes the sum of the weighted per-window
segment count:
{Wc}c=0...Cmax : arg min ki 'Wc=ki,V (13):
{WC'Vlv i=1
{Wi}i=1...K = {Wii Wi = V—Vkl-Ii = 1 ...K} (14)-
In summary, given the estimates for specific parameters for Prob(C = C),
multiple estimates of Prob (C = CIRGH), Prob (RGH) and some post-processing, we can find
2015/055579
the weight for a given window in a given individual 2'. Fig. 8 illustrates an example for the
weights (solid line) given the per-window counts from the original example in Fig. 2 d
line). For comparative purposes only, the weight is re-scaled so that a weight of 1 has a value
of 59 according to the y-axis. The weight calculation generates for every individual a weight
value n the interval 0 and 1) for the K segment s.
II.D. Calculate weighted sum OZ yer-window CM widths CMA2’3
The method 100 includes calculating 112 a ed sum of per-window cM
widths for each matched segment based on the first cM width and the weights associated with
the segment windows of the matched segment, according to some embodiments. In particular,
the method calculates the weighted sum of per-window cM widths for a matched segment,
between person A and person B, given the individual-specific window estimated weights for
person A and person B. The weighted sum of per-window cM widths CM £1'Bis the sum of the
first cM widths CM“- for each window 2' that the matched segment spans weighted by the
product of the weights for both individuals, A and B, in those segment windows:
WINends K
A.B _
CM2 A B
_ Wi _ CM1,i
— Z Wi (15),
i=WINstart
for a segment between indiViduals A and B, ng at window WINstart and ending at
window WINend with wfim and w‘fiin being the weights for both indiViduals, respectively, in
window 2'. If the windows are either the start or the end window of the matched segment, the
window widths are updated to be from either the first genetic marker in the window or to the
last genetic marker in the , respectively.
The weights are in the al between 0 and 1 and can intuitively be thought of
as a probability that this window should contribute to the new “width”. Taking the product of
the weights for indiViduals A and B ensures that the window is valid in both individuals to be
able to contribute to the weighted sum of per-window cM . The first and weighted
sum of per-window cM widths would be identical, if all weights are equal to one. The new
“width” or weighted sum of per-window cM widths lly is smaller than the raw or first
cM width, and, therefore, down weights those matches in s where there are a high
number of matches either in indiVidual A or indiVidual B for a large population. Thus, the
down weighting results in down-weight matches, which are less likely to be from recent
genealogical history of the indiViduals. Fig. 9 illustrates an e of per-window match
counts on a chunk of the genome for one indiVidual both pre-TIMBER (dashed line) and
post-TIMBER (solid line).
III. ANCESTRAL RELATIONSHIP PREDICTION
The method 100 es estimating 114 a degree of ancestral relatedness between
two individuals based on the ed sum of ndow CM widths of each matched
segment between the two individuals, according to some embodiments. If a pair of
individuals have a total sum of first cM widths of less than 60 cM, which ively removes
close relatives from the relatedness evaluation, the relationship distance is predicted based on
the sum of all the weighted sum of per-window cM widths of the shared segments.
Otherwise, the summation of the first cM widths is used. This results in a relationship
prediction that is more accurate for more distant relationships, while being as accurate for
close relationships. This especially true for certain ethnic groups (for example Jewish
people) and whether they are assigned to be distantly related or not.
The method 100 by using, for example, the degree of ral relatedness, allows
people to find their recent relatives and provide them with new information about their
genealogy within a network of relatives (with known genealogy). In some embodiments, the
degree of relatedness between two duals represents a probability that the two
individuals are ancestrally related and is equal to the weighted sum of per-window cM widths
of the individuals’ d segments. In some embodiments, the degree of relatedness
between two individuals is a binary yes or no answer whether the two individuals are
ancestrally related based on the ed sum of per-window cM widths. For e, if the
weighted sum of ndow cM widths exceeds a relatedness threshold value, the two
individuals are said to be ancestrally related. In some embodiments, the dness old
value is 20, 25, 30, 35, 40, 45, 50 cM, or any value larger than 20 cM. In some embodiments,
the relatedness threshold value is 30, 40 or 50 cM.
Previously, the prediction of the distance of relationship between two individuals
(e. g. cousins, third cousins, etc.) was based purely on the total width of all IBD segments,
where the width of the IBD segment was determined by its width in recombination distance
(in cM). In this method 100, a total score is calculated based on all IBD segments between
two individuals to provide relationship distance. However, method 100 uses a sum of
hted weighted sum of per-window cM widths in this calculation to predict relatedness,
if the sum of first cM widths of less than 60 cM.
In one examples, weight profiles are generated from matches to a static reference
set ofjust over 300K samples. Those weights are then used to re-score, i.e., re-weight, all
matches between any pair of duals in the database. The weights used by method 100 are
stored in their own database. In one example embodiment, based on the analysis of test sets
and real data, a threshold CM width of 5 CM was the minimum width, at which a matched
segment is included in the weight tion 110 and ed sum of per-window cM
widths calculation 112.
IV. EXAMPLE
In an e, TIMBER behavior was analyzed with a simulated test set. The test
set consisted of exactly 3703 pairs of genotypes (7406 individual samples) representing
relationships from parent/child (l s) to 5th cousins (12 s) and all in-between.
Each relationship was d independently of all others by simulating meioses to create
genotypes that represent individuals in the relevant part of the pedigree. The “founders” from
which non-simulated genotypes come were a set of approximately 24,000 genotypes from the
database that have no close estimated relationships among them based on the analysis of the
non-weighted first cM widths. ers” used to create a given onship were discarded
and not used at all for simulating any of the other relationships. The fact that the initial
founders had very few genuine relationships and were not reused helped to minimize the
likelihood of relationships among synthetic genotypes that we were unable to document, but
that were still possible. The test set might not be ideal for a real world scenario, since
individuals were randomly paired to be parents. However, since in the simulation the real
RGH segments were known, the tion provided a way to verify how well TIMBER
helped refining the analysis of matched segments. In the test, just over 300 pairs of each
meiosis level were represented between from parent-child relationships (1 meiosis) to 5th
cousins (l2 meioses).
Table 1: TIMBER results for different minimum cutoffs (= threshold cM width)
Percentage of
matched segments kept
Segment Min. of5 cM Min. of6 cM Min. of7 cM
Table 1 illustrates that TIMBER kept the vast majority (around 90%) of initially
discovered IBD segments that p with real IBD segments. The results did not largely
vary with respect to ent first cM widths of 5, 6 and 7 and were used to determine if a
discovered IBD segment was retained or not. TIMBER only kept at most 3% of the initially
discovered IBD segments that are false positives. Thus, TIMBER presents a very useful filter
for keeping the real signals while removing the false positive signals of IBD segments
n pairs of individuals due to the individuals’ recent genealogical history. No “ibs”
filtering was used to filter the raw and non-weighted output of matched IBD segments.
Table 2 shows that TIMBER was slightly more accurate for closer relatives, but
lly worked well across the spectrum of recent genealogical history from parent-child
relationships (1 s) to 5th cousins (12 meioses). Fig. 10 illustrates results from TIMBER
using different raw cM width, i.e., first cM width filters, including the matched tages
of segments kept for the known and unknown meioses.
Table 2: TIMBER results for true segments by s and different minimum cutoffs
Percentage of
matched segments kept
Min. of5 cM Min. of6 cM Min. of7 cM
V. ADDITIONAL CONSIDERATIONS
Computing system 120 is implemented using one or more computers having one
or more processors executing application code to perform the steps described herein, and data
may be stored on any conventional non-transitory storage medium and, where appropriate,
include a tional database server implementation. For purposes of clarity and because
they are well known to those of skill in the art, various components of a computer system, for
example, processors, memory, input devices, network devices and the like are not shown in
Fig. 1B. In some ments, a distributed ing architecture is used to implement the
described features. One example of such a distributed computing platform is the Apache
HADOOP® project available from the Apache Software Foundation.
In addition to the embodiments specifically described above, those of skill in the
art will appreciate that the invention may additionally be practiced in other embodiments.
Within this written description, the particular naming of the components, capitalization of
terms, the attributes, data structures, or any other programming or ural aspect is not
mandatory or significant unless otherwise noted, and the mechanisms that implement the
described invention or its features may have different names, formats, or protocols. Further,
the system may be implemented via a combination of hardware and re, as described, or
entirely in hardware elements. Also, the ular division of functionality between the
various system components described here is not mandatory; functions performed by a single
module or system component may instead be performed by multiple components, and
functions performed by multiple components may d be performed by a single
ent. Likewise, the order in which method steps are performed is not mandatory unless
otherwise noted or logically required. It should be noted that the process steps and
instructions of the present invention could be embodied in software, firmware or hardware,
and when embodied in software, could be downloaded to reside on and be operated from
different platforms used by real time network operating systems.
Algorithmic descriptions and entations included in this ption are
understood to be implemented by computer programs. rmore, it has also proven
convenient at times, to refer to these arrangements of operations as modules or code devices,
without loss of generality.
Unless otherwise indicated, discussions utilizing terms such as “selecting” or
“computing” or “determining” or the like refer to the action and ses of a computer
system, or similar electronic computing , that manipulates and transforms data
represented as physical (electronic) quantities within the computer system memories or
registers or other such information storage, transmission or y s.
The present ion also relates to an apparatus for performing the operations
herein. This apparatus may be specially constructed for the ed purposes, or it may
comprise a general-purpose computer selectively activated or reconfigured by a computer
program stored in the computer. Such a computer program may be stored in a non-transitory
computer readable storage medium, such as, but is not limited to, any type of disk including
floppy disks, l disks, DVDs, CD-ROMs, magnetic-optical disks, nly memories
(ROMS), random access memories , EPROMs, EEPROMs, magnetic or optical cards,
application specific integrated circuits (ASICs), or any type of media suitable for storing
electronic instructions, and each coupled to a computer system bus. Furthermore, the
2015/055579
computers referred to in the specification may include a single processor or may be
architectures employing multiple processor designs for increased computing capability.
The algorithms and displays presented are not inherently related to any particular
computer or other apparatus. Various general-purpose systems may also be used with
programs in ance with the teachings above, or it may prove convenient to construct
more specialized apparatus to perform the ed method steps. The required structure for
a variety of these s will appear from the description above. In addition, a variety of
programming ges may be used to implement the teachings above.
Finally, it should be noted that the language used in the specification has been
principally selected for readability and instructional purposes, and may not have been
selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure
of the present invention is intended to be illustrative, but not ng, of the scope of the
invention.
Claims (22)
1. A method for estimating a degree of ancestral relatedness corresponding to ical samples of two target individuals, the method comprising: ping the biological s of the two target individuals; extracting haplotype data of a population of duals, the ype data including a plurality of genetic markers shared among the individuals; dividing the haplotype data into segment windows based on the genetic s; for each individual in the population: based on the genetic markers, ng segments of the haplotype data that are identical between the individual and any other individual in the population, each matched segment having a first cM width exceeding a old cM width and being part of one or more of the segment windows; ng the matched ts in each segment window; estimating a weight associated with each segment window based on the count of matched segments in the associated t window; calculating a weighted sum of per-window cM widths for each matched t based on the first cM width and the weights associated with the segment windows of the matched segment; and estimating a degree of ancestral relatedness between the two target individuals based on the weighted sum of per-window cM widths of each matched segment between the two target individuals; wherein the degree of relatedness between the two target individuals comprises a probability that the two target individuals are ancestrally related; and outputting the probability indicative of the degree of relatedness of the two target individuals that is determined based on analysis of the biological samples.
2. The method of claim 1, wherein threshold cM width is 5 cM, 6 cM, 7 cM, 8 cM, 9 cM, 10 cM, or any real number within the range of 5 cM to 10 cM.
3. The method of claim 1, wherein the weight associated with a segment window for individual A is approximated as: Prob (�� = �� ��̂ |RGH)Prob(RGH)̂ �� �� �� = , Prob(��̂ = �� �� ) wherein Prob (�� = ��̂ |RGH), Prob(RGH)̂ , Prob(��̂ = c) are the estimates of a probability of an RGH segment given a measured count c of matched segments in a window, a probability of an RGH segment in a window, and the probability of measuring a count c of matched segments in a window, tively.
4. The method of claim 3, wherein Prob(RGH)̂ is approximated by the m of: Prob (�� = ��̂ |RGH) Prob(��̂ = �� )
5. The method of claim 1, n weighted sum of per-window cM widths for a segment between two target individuals A and B is calculated as: WINend≤ �� cM2�� ,�� = ∑ �� �� �� ∙ �� �� �� ∙ cM1,�� , �� =WINstart wherein cM1,�� is the first cM width for each window i that the matched segment spans and the segment between individuals A and B starts at window WINstart and ends at window WINend with �� ������ �� and �� ������ �� being the weights associated with segment window i for individual A and B, respectively.
6. The method of claim 1, wherein estimating the weight comprises calculating ary weights ��̃�� for the count c of matched segments in the ated segment window is approximated as: ��̃�� = 1, if �� ≤ �� , 0, if �� �� > �� �� −1 �� �� = { , �� �� − �� �� −1, else ��̃�� = 1 + ∑ �� �� , if c > �� , �� =�� +1 wherein �� �� is the weight based on the count of matched segments c and approximated as: Prob (�� = ��̂ |RGH)Prob(RGH)̂ �� �� = . Prob(��̂ = �� )
7. The method of claim 1, wherein the weight associated with each segment window ses if the count of matched segments in the associated segment window increases.
8. The method of claim 3 or 6, wherein Prob(��̂ = c) is approximated as: �� B(�� + �� , �� − �� + �� ) Prob(��̂ = c|c > 0) = ( ) , �� B(�� , �� ) n n is a size of the population and �� , �� are parameters of the Beta function B.
9. The method of claim 3, wherein the Prob (�� = ��̂ |RGH) is based on a user specified threshold value V, and Prob (�� = ��̂ |RGH,V) is approximated as: B(�� + �� , �� − �� + �� ) (�� ) �� B(�� , �� ) Prob (�� = ��̂ |RGH,V,c > 0) = ⁄ , Prob(0 < �� ≤ ��̂ |�� , �� ) wherein n is a size of the population, �� , �� are parameters of the Beta function B.
10. The method of claim 8, wherein �� and �� is estimated by using a maximum likelihood estimation of a joint distribution approximated as: �� B(�� �� − 1 + �� , �� − �� �� + 1 + �� ) �� ({�� �� }|�� , �� , �� , �� �� > 0) =∏ ( ) . �� �� − 1 B(�� , �� ) �� =1
11. The method of claim 9, wherein �� and �� is ted by using a maximum likelihood estimation of a joint distribution approximated as: B(�� �� − 1 + �� , �� − �� �� + 1 + �� ) ∏���� =1 ( �� ) �� �� −1 B(�� , �� ) �� ({�� �� }|�� , �� , �� , �� , �� �� > 0) = . Prob(0 < �� ≤ ��̂ |�� , �� )
12. The method of claim 1, wherein two target individuals are closely related if the sum of the first cM widths of matched segments over all matched segments between the two target duals exceeds a pre-defined close relative threshold of 30, 40, 50, 60, 70, 80, 90, 100 cM or any value from the range of 30 to 200 cM.
13. The method of claim 1, wherein a size of the segment windows comprises 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150 genetic s or any number that falls within the range of 50 to 500 genetic markers.
14. The method of claim 1, wherein the user specified threshold value V may include two specified quantiles, a first specified quantile equal to 50%, 55%, 60%, 65%, 70%, or 75%, and a second ied le equal to 75%, 80%, 85%, or 90%, to determine at least two estimates of Prob(�� = c|RGH,�� ).
15. The method of claim 1, wherein a size of the population is larger than 300,000.
16. The method of claim 1, wherein for a population size n and a number of genetic markers per t window d, a number of segment windows K is approximated as: �� = �� ��⁄ .
17. The method of claim 1, wherein close relatives are removed from the population.
18. The method of claim 16, wherein close ves comprise two target individuals having a total first cM width larger than 60 cM.
19. The method of claim 1, wherein the degree of relatedness between two target duals comprises a binary yes or no answer whether the two target individuals are ancestrally related.
20. A system for estimating a degree of ancestral dness corresponding to biological s of two target individuals, the system sing one or more processors configured to execute a set of steps and at least one memory configured to store the set of steps, the set of steps comprising: genotyping the biological samples of the two target individuals; extracting haplotype data of a population of individuals, the haplotype data ing a plurality of genetic markers shared among the individuals; dividing the haplotype data into segment windows based on the genetic markers; for each individual in the population: based on the genetic markers, matching segments of the haplotype data that are identical between the individual and any other individual in the population, each matched segment having a first cM width exceeding a threshold cM width and being part of one or more of the segment windows; counting the matched segments in each segment window; estimating a weight associated with each segment window based on the count of matched segments in the ated segment window; calculating a weighted sum of per-window cM widths for each d segment based on the first cM width and the weights ated with the segment windows of the d segment; and estimating a degree of ancestral relatedness between two target individuals based on the weighted sum of per-window cM widths of each matched t between the two target individuals wherein the degree of relatedness between two individuals comprises a probability that the two individuals are ancestrally related; and outputting the ility indicative of the degree of relatedness of the two target individuals that is determined based on analysis of the biological samples.
21. A non-transitory computer readable medium for storing computer code comprising instructions, the instructions, when executed by one or more processors, cause the one or more sors to: genotype biological samples of two target individuals; extract ype data of a population of individuals, the haplotype data including a plurality of c markers shared among the individuals; divide the haplotype data into segment windows based on the genetic markers; for each individual in the population: based on the genetic markers, match ts of the haplotype data that are identical n the individual and any other individual in the population, each matched segment having a first cM width exceeding a old cM width and being part of one or more of the segment windows; count the matched segments in each segment window; estimate a weight associated with each segment window based on the count of matched segments in the associated segment window; calculate a weighted sum of per-window cM widths for each d t based on the first cM width and the weights associated with the segment windows of the matched segment; and estimate a degree of ancestral relatedness n two target individuals based on the weighted sum of per-window cM widths of each matched segment between the two target individuals wherein the degree of relatedness between two individuals comprises a probability that the two individuals are ancestrally related; and outputting the probability indicative of the degree of dness of the two target individuals that is determined based on analysis of the biological samples.
22. The method of claim 1, ntially as herein described with reference to any one of the Examples and/or
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201462063849P | 2014-10-14 | 2014-10-14 | |
US62/063,849 | 2014-10-14 | ||
PCT/US2015/055579 WO2016061260A1 (en) | 2014-10-14 | 2015-10-14 | Reducing error in predicted genetic relationships |
Publications (2)
Publication Number | Publication Date |
---|---|
NZ731808A NZ731808A (en) | 2021-10-29 |
NZ731808B2 true NZ731808B2 (en) | 2022-02-01 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200286591A1 (en) | Reducing error in predicted genetic relationships | |
Weiß et al. | nQuire: a statistical framework for ploidy estimation using next generation sequencing | |
US11335435B2 (en) | Identifying ancestral relationships using a continuous stream of input | |
Pouyet et al. | Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences | |
Hormozdiari et al. | Colocalization of GWAS and eQTL signals detects target genes | |
Zhang et al. | TEAM: efficient two-locus epistasis tests in human genome-wide association study | |
Manichaikul et al. | Robust relationship inference in genome-wide association studies | |
Kosmicki et al. | Discovery of rare variants for complex phenotypes | |
Fan et al. | Functional linear models for association analysis of quantitative traits | |
Zhao et al. | Correction for population stratification in random forest analysis | |
Nosil et al. | Do highly divergent loci reside in genomic regions affecting reproductive isolation? A test using next-generation sequence data in Timema stick insects | |
Darnell et al. | Incorporating prior information into association studies | |
Montana | HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients | |
Datta et al. | Comparison of haplotype-based statistical tests for disease association with rare and common variants | |
Patil et al. | Repetitive genomic regions and the inference of demographic history | |
Stuglik et al. | Genomic heterogeneity of historical gene flow between two species of newts inferred from transcriptome data | |
Eriksson et al. | Detecting and removing ascertainment bias in microsatellites from the HGDP-CEPH Panel | |
Niehus et al. | PopDel identifies medium-size deletions jointly in tens of thousands of genomes | |
Huang et al. | Reveel: large-scale population genotyping using low-coverage sequencing data | |
Gosik et al. | iFORM/eQTL: an ultrahigh-dimensional platform for inferring the global genetic architecture of gene transcripts | |
Theunert et al. | Joint estimation of relatedness coefficients and allele frequencies from ancient samples | |
NZ731808B2 (en) | Reducing error in predicted genetic relationships | |
Walton et al. | Discordant Pleistocene population size histories in a guild of hymenopteran parasitoids | |
Chen et al. | Recombination map construction method using ONT sequence | |
Sun et al. | A genetical genomics approach to genome scans increases power for QTL mapping |