US20210183474A1

US20210183474A1 - Genetic and genealogical analysis for identification of birth location and surname information

Info

Publication number: US20210183474A1
Application number: US17/184,451
Authority: US
Inventors: Amir R. Kermany; Julie M. Granka; Keith D. Noto
Original assignee: Ancestry com DNA LLC
Current assignee: Ancestry com DNA LLC
Priority date: 2015-07-07
Filing date: 2021-02-24
Publication date: 2021-06-17
Also published as: NZ739413A; EP3320469A1; AU2021269307A1; WO2017006284A1; US20170011042A1; US10957422B2; AU2016290989A1; EP3320469A4; MX391111B; MX2018000293A; CA2991230C; CA2991230A1

Abstract

A system identifies ancestral birth locations or surnames estimated to be associated with an individual's ancestors using an individual's genetic sample. The system identifies users who are genetic matches to the individual and determines whether and how often a birth location or surname appears in the pedigrees of those users. Birth locations or surnames that appear frequently throughout the pedigrees of genetically matching users may represent birth locations or surnames that are affiliated with the individual's ancestors. The system determines whether the frequency of appearance of a birth location or surname is statistically significant to eliminate biases for certain birth locations or surnames that appear more frequently than others. The birth location or surname may be provided to the individual based on an also-determined enrichment score.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No. 15/203,776, filed on Jul. 6, 2016, which claims the benefit of U.S. Provisional Application No. 62/189,422, filed Jul. 7, 2015, both of which are incorporated by reference in their entirety for all purposes.

BACKGROUND

This description generally relates to population analyses on human genetic and genealogical information, and particularly to using that information to identify ancestral birth locations or ancestral surnames for an individual.
An individual may often be interested in learning more about his/her ancestral history including ancestral birth locations and/or ancestral surnames. Families may have genealogical pedigrees or family trees that may be verbally passed down from generation to generation. However, these genealogical family trees become inaccurate as they are passed along or may be missing the birth location or surnames of past ancestors altogether. Therefore, an individual often cannot rely on genealogical data provided by a family member to identify ancestral birth locations or surnames.

SUMMARY

Described embodiments identify likely birth locations and surnames of an individual's ancestors based on the individual's genotype, genotypes of a population of users who are genetic matches to the individual, and genealogical data (e.g. pedigree or family tree) of those matches. Note that no genealogical data for the individual is necessary for this identification to be performed.
In one embodiment, to generate identifications of possible ancestral birth locations and/or surnames for an individual, a birth location and surname identification system receives a genetic sample from the individual. The individual's genetic sample is sequenced and is analyzed to identify users in the system who are genetic matches to the individual. At least some of those genetically matched users will have an associated pedigree that identifies birth locations and/or surnames of their ancestors. A computer system determines the frequency of appearance of a birth location or surname amongst the pedigrees of the genetically matched users and further determines whether that frequency of appearance is of statistical significance. In various embodiments, the system performs a statistical test to prevent recommending birth locations or surnames that may be disproportionally represented. If the frequency of appearance of the birth location or surname is deemed statistically significant, the system may present it to the individual as a recommended ancestral birth location or surname.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, and accompanying drawings, where:

FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment.

FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic and genealogical data associated with users of the system in accordance with an embodiment.

FIG. 3 is a flow diagram for the operation of the birth location and surname identification module, in accordance with an embodiment.

FIG. 4 is a flow diagram for identifying ancestral birthplace or surname identifications for an individual, in accordance with an embodiment.

Note that for purposes of clarity, only one of each item corresponding to a reference numeral is included in most figures, but when implemented, multiple instances of any or all of the depicted modules may be employed, as will be appreciated by those of skill in the art.

DETAILED DESCRIPTION

Environment Overview

FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment. Depicted in FIG. 1 are an individual 101 (i.e. a human or other organism), a DNA extraction service 102, a birth location and surname identification system 100, a network 120, and a client device 160.
Individual 101 provides a DNA sample for analysis. In one embodiment, an individual 101 uses a sample collection kit to provide a DNA sample, e.g., saliva, from which genetic data can be reliably extracted according to conventional DNA processing techniques. DNA extraction service 102 receives the sample and estimates genotypes from the genetic data, for example by extracting the DNA from the sample and identifying genotype values of single nucleotide polymorphisms (SNPs) present within the DNA. The result in this example is a diploid genotype for each SNP. The birth location and surname identification system 100 receives the genetic data from DNA extraction service 102 and stores the genetic data in a DNA sample store 140 containing DNA diploid genotypes. The genetic data stored in the DNA sample store 140 may be associated with the individual 101 in the user data store 145 via one or more pointers.
Identifying ancestral birth locations or surnames that may be associated with a given individual involves analyzing genealogical information of other individuals that are genetic matches with the individual. To determine the genetic matches, analysis of identity-by-descent (IBD) is used. IBD analysis can be used to identify the familial relationship between any two people (e.g., second cousins) in a population as long as the relationship is due to shared common ancestors from the recent past (e.g., on the order of several hundred years). To date, IBD analysis has not been successfully used to accurately identify ancestral birth locations or surnames from an individual's genetic data.
To perform IBD, the birth location and surname identification system 100 includes an input data processing module 110 that processes the DNA to identify shared segments of DNA data between the individual 101 and a number of other users whose DNA is already stored by the system. An IBD estimation module 115 uses the shared segments of DNA are used to identify those other users known in the user data store 145 whose genetic data is stored in the DNA sample store 140 who are matches to the individual. The birth location and surname identification module 300 uses the match information to access genealogical data of the individual's 101 genetic matches in order to identify possible surnames and birth locations for the ancestors of the individual 101.
Each of these modules is described in further detail below. The breakdown of the logical functions of the system 100 into the above-introduced modules is for clarity of description only. In other embodiments, the computer system 100 may comprise more or fewer modules, and the logical structure may be differently organized. The data stores may be represented in different ways in different embodiments, such as comma-separated text files, or as databases such as relational databases (SQL) or non-relational databases (NoSQL).
The network 120 facilitates communications amongst one or more client devices 160 and the system 100. The network 120 may be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet. In various embodiments, the network 120 uses standard communication technologies and/or protocols. Examples of technologies used by the network 120 include Ethernet, 802.11, 3G, 4G, 802.16, or any other suitable communication technology. The network 120 may use wireless, wired, or a combination of wireless and wired communication technologies. Examples of protocols used by the network 120 include transmission control protocol/Internet protocol (TCP/IP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), file transfer protocol (TCP), or any other suitable communication protocol.
A client device 160 is a computing device capable of transmitting and/or receiving data via the network 120. In various embodiments, the client device 160 belongs to the individual 101 that provided the genetic sample. Examples of client devices 160 include desktop computers, laptop computers, tablet computers (pads), mobile phones, personal digital assistants (PDAs), gaming devices, or any other electronic device including computing functionality and data communication capabilities. The client device 160 may use a web browser 180, such as Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, Apple Safari and/or Opera, as an interface to connect with the network 120. Additionally or alternatively, specialized application software 180 that runs native on a mobile device is used as an interface to connect to the network 120.
The birth location and surname identification system 100 sends, through the network 120, a list of surnames or birthplaces to the client device 160 identified by system 100 for presentation to the individual 101. The list of surnames or birthplaces may be presented by the client device 160 on a user interface or a display screen.
Although not necessarily a part of any particular illustrated module, a new user to the system 100 who is submitting their DNA among other data will activate a new account, often through graphical user interface (GUI) provided through a mobile software application or a web-based interface. As part of the account activation process, the system 100 receives one or more types of basic personal information about the individual 101 such as age, date of birth, geographical location of birth (e.g., city, state, county, country, hospital, etc.), complete name including first, last middle names as well as any suffixes, and gender. This received user information is stored in the user data store 145, in association with the corresponding DNA samples stored in the DNA sample store 140.

Genetic Data Processing

To process the data stored in the DNA sample store 140 and estimate IBD from the DNA samples, the computing system 100 comprises an input data processing module 110, and an IBD estimation module 115. These modules are described in relation with FIG. 2 which is a flow diagram for the operation of the computer system 100 for estimating and storing estimated IBD in accordance with an embodiment.
II.a. DNA Sample Receipt and Account Creation
FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic, genealogical and survey input data. The input data processing module 110 is responsible for receiving, storing and processing data received from an individual 101 via the DNA extraction service 102. The input data processing module 110 includes a DNA collection module 210, a genealogical collection module 220, genotype identification module 240, and a genotype phasing module 250.
The DNA collection module 210 is responsible for receiving sample data from external sources (e.g., extraction service 102), processing and storing the samples in the DNA sample store 140. The data stored in the DNA sample store 140 may store one or more received samples DNA linked to a user as a <key, value> pair associated with the individual 101. In one instance, the <key, value> pair is <sampleID, “GA TC TC AA”>. The data stored in the DNA sample store 140 may be identified by one or more keys used to index one or more values associated with an individual 101. In one example, keys are a userID and sampleID, or alternatively another <key, value> pair is <userID, sampleID>. In various embodiments, the DNA sample store 140 stores a pointer to a location associated with the user data store 145 associated with the individual 101. The user data store 145 will be further described below.
II.b. Genealogical Data
The genealogical collection module 220 both receives and processes genealogical data and stores the data in the user data store 145. This data may be received for the individual 101, and may have been received in the past for other users of the system, some of whom may be determined to be genetic matches to the individual 101.
The genealogical data may include a variety of different types of information. The genealogical data can take the form of a pedigree of a user (e.g., the recorded relationships in a family). To collect the data, the genealogical collection module 220 may be configured to provide an interactive GUI that asks the user questions or provides a menu of options, and receives user input that can be processed to obtain the genealogical data. Examples of genealogical data that may be collected include, but are not limited to, names (first, last, middle, suffixes), birth locations (e.g., county, city, state, country, hospital, global map coordinates), date of birth, date of death, marriage information, family relations (manually provided rather than genetically identified), etc. These data may be manually provided or automatically extracted via, for example, optical character recognition (OCR) performed on census records, town or government records, or any other item of printed or online material.
The pedigree information associated with a user may include a genealogical graph. For example, the genealogical graph may include one or more specified nodes. Each specified node in the genealogical graph represents either the user or an ancestor of the user that could have passed down genetic material to the user.
The pedigree information provided by users may or may not be accurate or complete. The genealogical collection module 220 is responsible for filtering the received pedigree data based on one or more quality criteria in an effort to discard lower quality genealogical data. For example, the genealogical collection module 220 may filter the received pedigree data by excluding all pedigree nodes associated with a stored DNA sample that do not satisfy all of the following criteria: (1) recorded death date for a the linked pedigree node corresponds to official records (when available), (2) the gender is the same as the gender provided by the user; and (3) the birth date is within 3 years of the birth date provided by the user. The user may be prompted via GUI to resolve any discrepancies identified by module 220. In some embodiments, all received genealogical data marked as “private” are excluded from the any subsequent analysis to ensure that any privacy requirements imposed on the data are met.
II.c. Processing and Phasing DNA Samples
The genotype identification module 240 accesses the collected DNA data from the DNA collection module 210 or the sample store 140 and identifies autosomal SNPs so that the individual's diploid genotype on autosomal chromosomes can be computationally phased. The genotype identification module 240 provides the identified SNPs to the genotype phasing module 250 which phases the individual's diploid genotype based on the set of identified SNPs. The genotype phasing module 250 generates a pair of estimated haplotypes for each diploid genotype. The estimated haplotypes are then stored in the user data store 145 in association with the individual 101, and may also be stored in association with or verified against the genotypes of the individual's parents, who may also have their own separate accounts in the system 100. A variety of different computational phasing techniques may be used including, for example, the techniques described in U.S. Patent Application No. 2016/061,568, filed on Jan. 17, 2014, which is hereby incorporated by reference in its entirety. The phasing module 250 stores phased genotypes in the user data store 145.
II.d. IBD Estimation
The IBD estimation module 115 is responsible for identifying IBD segments (also referred to as IBD estimates) from phased genotype data (haplotypes) between the individual 101 and a user stored in the user data store 145. IBD segments are chromosome segments identified in the individual 101 and a user that are putatively inherited from a recent common ancestor. Typically, an individual 101 and a user who are closely related share a relatively large number of IBD segments, and the IBD segments tend to have greater length (individually or in aggregate across one or more chromosomes). Alternatively, an individual 101 and a user who are more distantly related share relatively few IBD segments, and these segments tend to be shorter (individually or in aggregate across one or more chromosomes).
In one embodiment, the IBD estimation algorithm used by the IBD estimation module 115 to estimate (or infer) IBD segments between an individual 101 and a user is as described in U.S. patent application Ser. No. 14/029,765, filed on Sep. 17, 2013, which is hereby incorporated by reference in its entirety. Another further processing step may be performed on these inferred IBD segments by applying the technique described in PCT Patent Application No. PCT/US2015/055579, filed on Oct. 14, 2015, which is hereby incorporated by reference in its entirety. The identified IBD segments are stored in the user data store 145 in association with the individual 101.
The IBD estimation module 115 is configured to estimate IBD segments between the individual 101 and large numbers of users stored in the user data store 145. In some embodiments of this module, the computing system has been optimized to efficiently handle large amounts of IBD data. Said another way, IBD is estimated across a large number of individuals based on their DNA. For example, in one implementation, the IBD estimation module 115 (and computing system 100 generally) distributes IBD computations over a Hadoop computing cluster, internal to or external from computing system 100, and stores the phased genotypes used in the IBD computations in a database so that IBD estimates for new accounts/individuals can be quickly compared to previously processed individuals.

Birth Location and Surname Identification System

III.a. Identifying Genetic Matches
FIG. 3 depicts the birth location and surname identification module 300, in accordance with an embodiment. The birth location and surname identification module 300 includes a genetic match module 305, a location frequency calculator 310, a surname frequency calculator 315, a statistical analysis module 320, an enrichment score module 325, a list generation module 330, a location frequency store 350, a surname frequency store 360, a location score store 380, and a surname score store 390.
The genetic match module 305 retrieves the IBD estimates between the individual 101 and the users in the user data store 145 and determines whether the individual 101 and any given other user are a genetic match. In one embodiment, the individual 101 and a user are a match if they have higher than a threshold amount of IBD segment sharing, as determined by the IBD estimation module 115. A match may indicate that the individual 101 and the user are related (e.g. parent/child, sibling, aunt/uncle, first cousin, first cousin once removed, second cousin, second cousin once removed). The genetic match module 305 identifies all users in the user data store 145 that are considered matches to the individual 101. The set of user matches is referred to herein by M_u. In an example, the number of matches is limited to the top 3000 matches (i.e. |M_u|≤3000) sorted based on amount of IBD segment sharing.
III.b. Match and Background Frequency for a Birth Location
The genetic match module 305 provides the set of user genetic matches, M_u, to the location frequency calculator 310 to determine an identification of possible birth locations associated with the user. The location frequency calculator 310 determines how frequently a particular birth location appears amongst the pedigrees of the users within the set of user matches, M_u. To do this, the location frequency calculator 310 retrieves, for each matching user in the set of user matches M_u, the matching user's pedigree. The pedigree includes a genealogical graph from the user data store 145. For example, a genealogical graph in the pedigree may be the matching user's family tree that describes the relationship between the matching user and each of the matching user's relatives. Each relative in the matching user's pedigree has associated genealogical data such as the relative's birth location.
For a given matching user v ∈ M_u, T_vdenotes a set of birth locations indicated in matching user v's pedigree. For each matching user, the location frequency calculator 310 identifies the set of birth locations, T_v, in the matching user's pedigree. For example, the location frequency calculator 310 may identify a matching user v as having 10 relatives born in New York City, 2 relatives born in Boston, and 2 relatives born in Los Angeles. Therefore, the elements in T_vinclude New York City, Boston, and Los Angeles. A presence indicator a_v,i, may be represented by the indicator function representing whether a birth location i is indicated in matching user v's pedigree:
$\begin{matrix} a_{v, 1} = {\begin{matrix} 1 & if i \in T_{v} \\ 0 & otherwise \end{matrix}, & (1) \end{matrix}$
Thus, for this example, matching user v has an indicator function score of 1 for birth locations of New York City, Boston, and Los Angeles. All other birth locations (e.g. Washington D.C., elsewhere) would have a presence indicator and corresponding indicator function score of 0.
The location frequency calculator 310 repeats this process for the set of matches M_u. For a given birth location, i, the total number of pedigrees (m_i) of users in the set of matches that have this birth location is determined according to:
$\begin{matrix} m_{i} = \sum_{v \in M_{u}} a_{v, i} & (2) \end{matrix}$
For example, the location frequency calculator 310 summates the indicator function score for each birth location. Thus, if the number of matching users is 1000, the maximum number of pedigrees (max(m_i)) that have the birth location is 1000, which would occur if every user in the set of matches M_uhas the birth location in their pedigree.
The location frequency calculator 310 uses the total number of pedigrees m_i, to determine p_i, the match frequency of a birth location i, where p_iis determined according to:
p _i =m _i/|M _u| (3)
where |M_u| is the total number of matching users in the set M_u. Returning to the previous example, if all 1000 users in the set of matches M_u(e.g. |M_u|=1000) had the birth location of New York City in their pedigree (e.g. m_i=1000), then the match frequency of New York City is p_{New York City}=1. Therefore, the match frequency of a birth location represents how often matching users in the set of matches M_uare associated with the birth location, which can be used as a way of determining an association between the ancestors of the individual 101 and the birth location. This match frequency is stored in the location frequency store 350.
The location frequency calculator 310 also calculates a background frequency for each birth location i. The background frequency of a birth location provides an indication as to how often the birth location appears amongst the greater population of users stored in the system, including those who are not matches to the individual. For example, high population cities such as New York City or Boston may have higher background frequencies than smaller cities such as Cheyenne, Wyo. Here, D represents the total set of users in the system. Generally the number of users in set D is significantly larger (e.g. multiple orders of magnitude larger) than the number of matching users in the set of matches M_u. Each user in D may have a corresponding pedigree. Altogether, this forms the set of all pedigrees stored in the user data store 145.
To determine the background frequency, the location frequency calculator 310 may use a similar indicator function as was previously shown in equation (1) to calculate whether a birth location i, exists in the pedigree corresponding to user w in the set D:
$\begin{matrix} a_{w, i} = {\begin{matrix} 1 & if birth location, i, exists in user w^{'} s pedigree \\ 0 & otherwise \end{matrix} & (4) \end{matrix}$
The location frequency calculator 310 summates the total number of pedigrees that each have the birth location, each pedigree corresponding to a user w in the set of D. To calculate the background frequency, the location frequency calculator 310 divides the summated total number of users in the set of D that have the birth location by the total number of users in the set of D. Therefore, the background frequency of a birth location, i, is expressed in equation 5 as
$\begin{matrix} q_{i} = \frac{1}{\langle D \rangle} \sum_{w \in D} a_{w, i} & (5) \end{matrix}$
This background frequency is stored in the location frequency store 350.
III.c. Match and Background Frequency for a Surname
The surname frequency calculator 315 calculates a match frequency and background frequency for each surname in the pedigrees of matching users in a similar fashion as was discussed for birth locations in section III.b. The surname frequency calculator 315 receives the set of user matches, M_u, from the genetic match module 305 for an individual 101 and determines a match frequency p_j,that represents how often a given surname, j (e.g. Bradley, O'Malley, Johnson), appears amongst the pedigrees of users, v, in the set of user matches, M_u. For example, for a surname “Bradley” (e.g. j=“Bradley”),
$\begin{matrix} p_{j} = \frac{Σ_{v \in M_{u}} a_{v, j}}{\langle M_{u} \rangle} & (6) \end{matrix}$
where a_v,jis the indicator function previously described in equation (1). The surname frequency calculator also calculates the background frequency for the surname “Bradley” in the total set of users in the system, D.
$\begin{matrix} q_{j} = \frac{1}{\langle D \rangle} Σ_{w \in D} a_{w, j} & (7) \end{matrix}$
The match frequency, p_j, and background frequency, q_j, may each be stored in the surname frequency store 360.
Often, there will be many more surnames in the user store 145 than birth locations. Many of those surnames will be similar to others, with only minor variations in spelling. Unmodified, these variant spellings may reduce the efficacy of the surname frequency calculation. To address this, the surname frequency calculator 315 may first normalize, meaning that the surname frequency calculator 315 may consider many alternate spellings as being the same surname for purposes of frequency calculations. Examples of such alternate spellings may include use of characters not used in English (e.g., “o” versus “ø”); capitalization, punctuation, and spacing (“O'Malley” versus “Omalley”); suffixes (“Jr.”); and commentary (“Johnson (WWII Veteran)”). A simple normalization is performed that ignores capitalization and punctuation, and removes commentary, thereby reducing the set of surnames under consideration. Alternate spellings and misspellings may be interpreted by the surname frequency calculator 315 as a different surname.
III.d Calculating the Statistical Likelihood
The statistical analysis module 320 identifies which birth locations and surnames are sufficiently notable for the individual 101 under consideration so as to merit possibly providing to the individual 101 as likely being associated with their own ancestors.
For example, there may be a total of 1000 users in the set of matches, M_u. Assume that 10% of the users that are in the set of matches, M_u, have a particular birth location in their pedigree (i.e. m_i=100, therefore p_i=0.1). A match frequency, p_i, of 10% may appear to be a very high number of appearances for a birth location. However, if the background frequency, q_i, is also close to 10%, meaning that the birth location appears approximately equally frequently in the pedigrees of all users in the system, then a match frequency, p_i, of 10% may not be sufficiently notable to be worth identifying as associated with the individual.
The statistical analysis module 320 receives the match frequency, p_i, and background frequency, q_i, for all different birth locations, i, from the location frequency calculator 310. The statistical analysis module 320 conducts a statistical analysis test to determine whether the match frequency of a given birth location is sufficiently notable. For each birth location, i, the statistical analysis module 320 determines the likelihood of observing the received match frequency, p_i, and background frequency, q_iunder a null hypothesis H₀scenario. An example of a null hypothesis H₀is the assumed scenario where the match frequency and background frequency are the same (i.e. p_i=q_i). Conversely, the alternative hypothesis H₁is the assumed scenario where the match frequency and background frequency are non-equal (i.e. p_i≠q_i), with the assumption that if, particularly, p_i>q_i, then p_i, and thus i, may be statistically significant and therefore worth possibly providing to the individual.
Therefore, the statistical analysis module 320 determines the likelihood of observing the received match frequency and background frequency under the assumption that the match frequency, p_i, and background frequency, q_i, are equal. However, if the received match frequency is sufficiently larger than the received background frequency, then the null hypothesis H₀is rejected in favor of the alternative hypothesis H₁. What constitutes a sufficient difference between the received match frequency, p_i, and background frequency, q_i, will be discussed further below in regards to the summary statistic S_i.
A similar calculation may be performed for surnames by receiving the match frequency, p_j, and background frequency, q_j, for all surname identifications, j, from the surname frequency calculator 315. The subsequent discussion focuses on conducting a statistical test for a birth location, i. This discussion may also refer to conducting a statistical test for a surname.
The statistical test is performed under a null hypothesis H₀, the assumed scenario where the match frequency, p_i, and the background frequency, q_i, are equal. In various embodiments, the statistical analysis module 320 conducts a maximum likelihood ratio test. In other examples, the statistical analysis test may be a Pearson's chi-squared test, a Z-test, or a F-test. The test statistic, Λ, for the maximum likelihood ratio test is determined according to:
$\begin{matrix} Λ = \frac{L (m_{i} | H_{0})}{\max_{p \in (0, 1)} L (m_{i} | H_{1})} & (8) \end{matrix}$
where L(m_i|H₀) denotes the likelihood of observing m_iunder the null hypothesis that the match frequency and background frequency are equal (i.e. p_i=q_i) for a birth location i and max_p∈(0,1)L(m_i|H₁) denotes the likelihood of observing m_iunder the alternative hypothesis when varying p between 0 and 1. Thus, the test statistic is a ratio between a first likelihood of observing the match frequency and background frequency under the null hypothesis and a second likelihood of observing the match frequency and background frequency under the alternative hypothesis.
A summary statistic, S_i, is determined using Λ according to:
$\begin{matrix} S_{i} = - \log (Λ) = m_{i} \log \frac{p_{i}}{q_{i}} + (\langle M_{u} \rangle - m_{i}) \log \frac{1 - p_{i}}{1 - q_{i}}, & (9) \end{matrix}$
The statistical analysis module 320 calculates a summary statistic for each birth location i. Note that if the match frequency, p_i, and background frequency, q_i, for a birth location received from the location frequency calculator 310 are equal, then the value of the summary statistic is zero. Additionally, the summary statistic S_iincreases in magnitude as the difference between the match frequency, p_i, and background frequency, q_i, increases in magnitude.
According to Wilks' theorem, as the sample size increases, twice the summary statistic 2S_iwill follow a first order chi-squared distribution. Therefore, the statistical analysis model 320 may calculate the p-value for rejecting the null hypothesis H₀based on the first order chi-squared distribution of the summary statistic, S_i.
For example at a significance level of 0.995 (p-value=5×10⁻³), the null hypothesis is rejected if S_i>4 (or 2S_i>8) based on the first order chi-squared distribution. In other words, if the match frequency is sufficiently larger than the background frequency for a particular birthplace or surname such that the summary statistic S_i>4, then the alternative hypothesis (i.e. where the match frequency does not equal the background frequency) is accepted. This indicates that the particular birthplace or surname is sufficiently notable to be associated with the ancestors of the individual 101.
The exact value of the significance level may vary by implementation, or according to more specific factors. Also, although the above embodiment describes the significance level as being a p-value, in practice it may be any threshold which determines whether or not a particular birth location i or surname j is sufficiently statistically significant to merit consideration for providing to the individuals.
For birth locations specifically, the statistical analysis module 320 may adjust the significance level (e.g., p-value) for a birth location i, based on the country of origin of the birth location. In various embodiments, the birth location i from a particular country of origin is determined based on the latitude and longitudinal coordinates associated with the birth location. More specifically, the particular significance level for a country of origin is chosen based on the number of users in the database associated with those countries in their respective pedigrees and the number of matches that a given individual has in the database that are annotated with a pedigree attached to them. For example, a birth location that derives from a country having a large number of users associated with that country (e.g., United States or Nordic countries) may utilize a relatively high significance level (e.g., 0.995), whereas a birth location that derives from a country having relatively few users associated with that country (e.g., Mexico, Russia, Eastern Europe) may utilize a relatively lower significance level (e.g., 0.9). Thus, depending upon the country, the threshold needed to determine whether the difference between the birth location match frequency versus the corresponding background frequency is statistically significant may vary. This allows the module 300 to better take into account the relative availability of data regarding a particular country in determining whether or not particular birth locations are statistically significant.
III.e Calculating the Enrichment Score
The statistical analysis module 320 determines which birth locations and surnames are statistically significant given the information known about them from the underlying pedigree data from users genetically matched to an individual 101. The enrichment score module 325 uses this binary determination of statistical significance to determine an enrichment score representing a strength of association between the birth location or surname and the ancestors of the individual 101.
To do this, the enrichment score module 325 determines an enrichment score, x_i, for each birth location, i, or enrichment score, x_j, for each surname, j. The enrichment score module 325 receives the summary statistic, S_i, for each birth location or the summary statistic, S_j, for each surname. Additionally, the enrichment score module 325 receives the match frequency, p_ior p_j, and the background frequency, q_iand q_j, for birth locations and surnames.
To calculate an enrichment score of a birth location, i, at the previously selected significance level of 0.995, the enrichment score module 325 calculates the enrichment score to be:
$\begin{matrix} x_{i} = {\begin{matrix} p_{i} * \log \frac{p_{i}}{q_{i}} & if S_{i} > 4 \\ 0 & otherwise \end{matrix} & (10) \end{matrix}$
The exact form of the calculation may vary in practice, particularly the significance level may vary by country as described above. Note that if the match frequency, p_i, and the background frequency, q_i, are not significantly different, then the enrichment score is close to zero, indicating that the particular birth location may not be very relevant to ancestors of the individual 101. Scaling the match frequency by a factor of log p/q eliminates biases towards highly popular birth locations and surnames because they are likely to have a high background frequency (high q) as well, thereby reducing the enrichment score.
In one embodiments, the enrichment score module 325 calculates the enrichment score for a surname, j, in the same manner according to equation 10. In another embodiment, the enrichment score module 325 calculates the enrichment score for a surname, j, as
$\begin{matrix} x_{j} = p_{j} * \log \frac{p_{j}}{q_{j}} & (10) \end{matrix}$
for all summary statistic values.
The respective enrichment scores for a birth location and surname are stored in the location score store 380 and surname score store 390, respectively.
III.f Generating a List of Identified Birth Locations or Surnames
The enrichment score module 325 provides the enrichment score associated with each birth location or surname to the list generation module 330. The list generation module 330 may rank and/or provide one or more birth locations or surnames to the individual through client device 160 based on their associated enrichment scores. Exactly how the list generation module 330 provides birth locations and surnames, how many, and in what form (e.g., lists, etc.) may vary by implementation. In one embodiment, the list generation module 330 may set a minimum threshold in order for a particular birth location or surname to be recommended. For example, a birth location or surname may have to meet a certain minimum threshold enrichment score, and/or must appear in at least some number of pedigrees (e.g., m_i≥3) for it to be recommended.
In one embodiment, the list generation module 330 generates a list 370 including only the top N birth locations or surnames by enrichment score. The list 370 is sent through the network 120 to the client device 160 for consumption by the individual 101.

Identifying a Birth Location or Surname

FIG. 4 illustrates a process of providing an identified birth location or surname to an individual, according to one embodiment. The birth location and surname identification system 100 receives 405 a sequence of genetic data from an individual 101. The system 100 identifies users in the system 100 that are genetic matches with the individual 101. This may be accomplished by identifying DNA segment matches on the pair of haplotypes for the individual and the pair of haplotypes for users retrieved from the user data store 145.
Each user in the system 100 has a corresponding pedigree stored in the user data store 145. From the set of all pedigrees in the user data store 145, a subset of matching pedigrees is identified. Each pedigree in the subset of matching pedigrees is associated with a user that is a genetic match of the individual 101. The system 100 determines 420 a match frequency, p, of the birth location or surname amongst the subset of matching pedigrees. Additionally, the system 100 determines 425 a background frequency, q, of the birth location or surname amongst the set of all pedigrees.
The system 100 identifies 430 the likelihood of observing the match frequency and background frequency for a birth location or surname under the assumed scenario that the match frequency and background frequency are equal. The statistical analysis module 320 of the system 100 conducts a statistical test and determines 435 an enrichment score for each birth location or surname. Based on the statistical test and the enrichment score, the system may provide 440 the birth location or surname to the individual 101.

Additional Considerations

The birth location and surname identification system 100 is implemented using one or more computers having one or more processors executing application code to perform the steps described herein, and data may be stored on any conventional non-transitory storage medium and, where appropriate, include a conventional database server implementation. For purposes of clarity and because they are well known to those of skill in the art, various components of a computer system, for example, processors, memory, input devices, network devices and the like are not shown in FIG. 1. In some embodiments, a distributed computing architecture is used to implement the described features.
In addition to the embodiments specifically described above, those of skill in the art will appreciate that the invention may additionally be practiced in other embodiments. Within this written description, the particular naming of the components, capitalization of terms, the attributes, data structures, or any other programming or structural aspect is not mandatory or significant unless otherwise noted, and the mechanisms that implement the described invention or its features may have different names, formats, or protocols. Further, the system may be implemented via a combination of hardware and software, as described, or entirely in hardware elements. Also, the particular division of functionality between the various system components described here is not mandatory; functions performed by a single module or system component may instead be performed by multiple components, and functions performed by multiple components may instead be performed by a single component. Likewise, the order in which method steps are performed is not mandatory unless otherwise noted or logically required. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
Algorithmic descriptions and representations included in this description are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules or code devices, without loss of generality.
Unless otherwise indicated, discussions utilizing terms such as “selecting” or “determining” or “estimating” or the like refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Finally, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.

Claims

1. A method comprising:

receiving a genetic dataset of an individual;

identifying a set of related individuals who are related to the individual based on the genetic dataset, wherein the set of related individuals are identified through an Identity-By-Descent (IBD) estimation based on shared segments of DNA data between the genetic dataset of the individual and genetic datasets of the set of related individuals wherein the shared segments are from phased genetic data of the individual and the set of related individuals;

identifying a genetic group associated with the individual, the genetic group containing the set of related individuals;

identifying a surname associated with the genetic group, wherein the surname is determined to be a significant surname to the genetic group based on a frequency of the surname in the genetic group; and

outputting the surname.

2. The method of claim 1, wherein the genetic group corresponds to a geographic region.

3. The method of claim 1, wherein the surname is associated with a significance score that indicates a significance level of the surname to the genetic group.

4. The method of claim 3, wherein the significance score is determined based on a match frequency and a background frequency associated with the surname.

5. The method of claim 1, further comprising:

identifying one or more additional surnames that are significant to the genetic group;

outputting the one or more additional surnames.

6. The method of claim 5, wherein the one or more additional surnames are each associated with a significance score to the genetic group and the one or more additional surnames are outputted in an order based on the significance scores.

7. The method of claim 1 wherein the phased genetic data comprises a pair of haplotypes for the individual.

8. The method of claim 7 wherein the set of related individuals are IBD matches based on the pair of haplotypes for the individual and haplotypes of the set of related individuals.

9. The method of claim 1, wherein identifying a surname associated with the genetic group comprises:

accessing one or more of pedigrees of the set of related individuals, each pedigree comprising a genealogical graph of relatives for a member of the set of related individuals;

identifying a frequency of the surname in the genetic group and in the one or more of pedigrees; and

determining the surname is signficant based on the frequency.

10. A non-transitory computer-readable medium comprising computer program code, the computer program code when executed by a processor causing the processor to perform steps comprising:

receiving a genetic dataset of an individual;

outputting the surname.

11. The non-transitory computer-readable medium of claim 10, wherein the genetic group corresponds to a geographic region.

12. The non-transitory computer-readable medium of claim 10, wherein the surname is associated with a significance score that indicates a significance level of the surname to the genetic group.

13. The non-transitory computer-readable medium of claim 12, wherein the significance score is determined based on a match frequency and a background frequency associated with the surname.

14. The non-transitory computer-readable medium of claim 10, wherein the steps further comprising:

outputting the one or more additional surnames.

15. The non-transitory computer-readable medium of claim 14, wherein the one or more additional surnames are each associated with a significance score to the genetic group and the one or more additional surnames are outputted in an order based on the significance scores.

16. The non-transitory computer-readable medium of claim 10, wherein the phased genetic data comprises a pair of haplotypes for the individual.

17. The non-transitory computer-readable medium of claim 16, wherein the set of related individuals are IBD matches based on the pair of haplotypes for the individual and haplotypes of the set of related individuals.

18. The non-transitory computer-readable medium of claim 10, wherein identifying a surname associated with the genetic group comprises:

determining the surname is signficant based on the frequency.

19. A computer system comprising:

one or more processors; and

a non-transitory computer readable storage medium storing instructions, when executed by one or more processors, causing the one or more processors to perform steps comprising:

receiving a genetic dataset of an individual;

outputting the surname.

20. The system of claim 19, wherein identifying a surname associated with the genetic group comprises steps:

determining the surname is significant based on the frequency.