US20210183474A1 - Genetic and genealogical analysis for identification of birth location and surname information - Google Patents

Genetic and genealogical analysis for identification of birth location and surname information Download PDF

Info

Publication number
US20210183474A1
US20210183474A1 US17/184,451 US202117184451A US2021183474A1 US 20210183474 A1 US20210183474 A1 US 20210183474A1 US 202117184451 A US202117184451 A US 202117184451A US 2021183474 A1 US2021183474 A1 US 2021183474A1
Authority
US
United States
Prior art keywords
surname
genetic
individual
frequency
related individuals
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/184,451
Inventor
Amir R. Kermany
Julie M. Granka
Keith D. Noto
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ancestry com DNA LLC
Original Assignee
Ancestry com DNA LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ancestry com DNA LLC filed Critical Ancestry com DNA LLC
Priority to US17/184,451 priority Critical patent/US20210183474A1/en
Assigned to ANCESTRY.COM DNA, LLC reassignment ANCESTRY.COM DNA, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NOTO, Keith D., GRANKA, Julie M., KERMANY, Amir R.
Publication of US20210183474A1 publication Critical patent/US20210183474A1/en
Assigned to WILMINGTON TRUST, NATIONAL ASSOCIATION, AS NOTES COLLATERAL AGENT reassignment WILMINGTON TRUST, NATIONAL ASSOCIATION, AS NOTES COLLATERAL AGENT PATENT SECURITY AGREEMENT Assignors: ANCESTRY.COM DNA, LLC, ANCESTRY.COM OPERATIONS INC.
Assigned to CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLATERAL AGENT reassignment CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLATERAL AGENT PATENT SECURITY AGREEMENT Assignors: ANCESTRY.COM DNA, LLC, ANCESTRY.COM OPERATIONS INC.
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/123DNA computing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis

Definitions

  • This description generally relates to population analyses on human genetic and genealogical information, and particularly to using that information to identify ancestral birth locations or ancestral surnames for an individual.
  • Families may have genealogical pedigrees or family trees that may be verbally passed down from generation to generation. However, these genealogical family trees become inaccurate as they are passed along or may be missing the birth location or surnames of past ancestors altogether. Therefore, an individual often cannot rely on genealogical data provided by a family member to identify ancestral birth locations or surnames.
  • Described embodiments identify likely birth locations and surnames of an individual's ancestors based on the individual's genotype, genotypes of a population of users who are genetic matches to the individual, and genealogical data (e.g. pedigree or family tree) of those matches. Note that no genealogical data for the individual is necessary for this identification to be performed.
  • genealogical data e.g. pedigree or family tree
  • a birth location and surname identification system receives a genetic sample from the individual.
  • the individual's genetic sample is sequenced and is analyzed to identify users in the system who are genetic matches to the individual. At least some of those genetically matched users will have an associated pedigree that identifies birth locations and/or surnames of their ancestors.
  • a computer system determines the frequency of appearance of a birth location or surname amongst the pedigrees of the genetically matched users and further determines whether that frequency of appearance is of statistical significance.
  • the system performs a statistical test to prevent recommending birth locations or surnames that may be disproportionally represented. If the frequency of appearance of the birth location or surname is deemed statistically significant, the system may present it to the individual as a recommended ancestral birth location or surname.
  • FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment.
  • FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic and genealogical data associated with users of the system in accordance with an embodiment.
  • FIG. 3 is a flow diagram for the operation of the birth location and surname identification module, in accordance with an embodiment.
  • FIG. 4 is a flow diagram for identifying ancestral birthplace or surname identifications for an individual, in accordance with an embodiment.
  • FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment. Depicted in FIG. 1 are an individual 101 (i.e. a human or other organism), a DNA extraction service 102 , a birth location and surname identification system 100 , a network 120 , and a client device 160 .
  • an individual 101 uses a sample collection kit to provide a DNA sample, e.g., saliva, from which genetic data can be reliably extracted according to conventional DNA processing techniques.
  • DNA extraction service 102 receives the sample and estimates genotypes from the genetic data, for example by extracting the DNA from the sample and identifying genotype values of single nucleotide polymorphisms (SNPs) present within the DNA. The result in this example is a diploid genotype for each SNP.
  • the birth location and surname identification system 100 receives the genetic data from DNA extraction service 102 and stores the genetic data in a DNA sample store 140 containing DNA diploid genotypes.
  • the genetic data stored in the DNA sample store 140 may be associated with the individual 101 in the user data store 145 via one or more pointers.
  • Identifying ancestral birth locations or surnames that may be associated with a given individual involves analyzing genealogical information of other individuals that are genetic matches with the individual. To determine the genetic matches, analysis of identity-by-descent (IBD) is used. IBD analysis can be used to identify the familial relationship between any two people (e.g., second cousins) in a population as long as the relationship is due to shared common ancestors from the recent past (e.g., on the order of several hundred years). To date, IBD analysis has not been successfully used to accurately identify ancestral birth locations or surnames from an individual's genetic data.
  • IBD identity-by-descent
  • the birth location and surname identification system 100 includes an input data processing module 110 that processes the DNA to identify shared segments of DNA data between the individual 101 and a number of other users whose DNA is already stored by the system.
  • An IBD estimation module 115 uses the shared segments of DNA are used to identify those other users known in the user data store 145 whose genetic data is stored in the DNA sample store 140 who are matches to the individual.
  • the birth location and surname identification module 300 uses the match information to access genealogical data of the individual's 101 genetic matches in order to identify possible surnames and birth locations for the ancestors of the individual 101 .
  • the breakdown of the logical functions of the system 100 into the above-introduced modules is for clarity of description only.
  • the computer system 100 may comprise more or fewer modules, and the logical structure may be differently organized.
  • the data stores may be represented in different ways in different embodiments, such as comma-separated text files, or as databases such as relational databases (SQL) or non-relational databases (NoSQL).
  • the network 120 facilitates communications amongst one or more client devices 160 and the system 100 .
  • the network 120 may be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet.
  • the network 120 uses standard communication technologies and/or protocols. Examples of technologies used by the network 120 include Ethernet, 802.11, 3G, 4G, 802.16, or any other suitable communication technology.
  • the network 120 may use wireless, wired, or a combination of wireless and wired communication technologies. Examples of protocols used by the network 120 include transmission control protocol/Internet protocol (TCP/IP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), file transfer protocol (TCP), or any other suitable communication protocol.
  • TCP/IP transmission control protocol/Internet protocol
  • HTTP hypertext transport protocol
  • SMTP simple mail transfer protocol
  • TCP file transfer protocol
  • a client device 160 is a computing device capable of transmitting and/or receiving data via the network 120 .
  • the client device 160 belongs to the individual 101 that provided the genetic sample.
  • client devices 160 include desktop computers, laptop computers, tablet computers (pads), mobile phones, personal digital assistants (PDAs), gaming devices, or any other electronic device including computing functionality and data communication capabilities.
  • the client device 160 may use a web browser 180 , such as Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, Apple Safari and/or Opera, as an interface to connect with the network 120 . Additionally or alternatively, specialized application software 180 that runs native on a mobile device is used as an interface to connect to the network 120 .
  • the birth location and surname identification system 100 sends, through the network 120 , a list of surnames or birthplaces to the client device 160 identified by system 100 for presentation to the individual 101 .
  • the list of surnames or birthplaces may be presented by the client device 160 on a user interface or a display screen.
  • a new user to the system 100 who is submitting their DNA among other data will activate a new account, often through graphical user interface (GUI) provided through a mobile software application or a web-based interface.
  • GUI graphical user interface
  • the system 100 receives one or more types of basic personal information about the individual 101 such as age, date of birth, geographical location of birth (e.g., city, state, county, country, hospital, etc.), complete name including first, last middle names as well as any suffixes, and gender.
  • This received user information is stored in the user data store 145 , in association with the corresponding DNA samples stored in the DNA sample store 140 .
  • the computing system 100 comprises an input data processing module 110 , and an IBD estimation module 115 . These modules are described in relation with FIG. 2 which is a flow diagram for the operation of the computer system 100 for estimating and storing estimated IBD in accordance with an embodiment.
  • FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic, genealogical and survey input data.
  • the input data processing module 110 is responsible for receiving, storing and processing data received from an individual 101 via the DNA extraction service 102 .
  • the input data processing module 110 includes a DNA collection module 210 , a genealogical collection module 220 , genotype identification module 240 , and a genotype phasing module 250 .
  • the DNA collection module 210 is responsible for receiving sample data from external sources (e.g., extraction service 102 ), processing and storing the samples in the DNA sample store 140 .
  • the data stored in the DNA sample store 140 may store one or more received samples DNA linked to a user as a ⁇ key, value> pair associated with the individual 101 .
  • the ⁇ key, value> pair is ⁇ sampleID, “GA TC TC AA”>.
  • the data stored in the DNA sample store 140 may be identified by one or more keys used to index one or more values associated with an individual 101 .
  • keys are a userID and sampleID, or alternatively another ⁇ key, value> pair is ⁇ userID, sampleID>.
  • the DNA sample store 140 stores a pointer to a location associated with the user data store 145 associated with the individual 101 .
  • the user data store 145 will be further described below.
  • the genealogical collection module 220 both receives and processes genealogical data and stores the data in the user data store 145 . This data may be received for the individual 101 , and may have been received in the past for other users of the system, some of whom may be determined to be genetic matches to the individual 101 .
  • the genealogical data may include a variety of different types of information.
  • the genealogical data can take the form of a pedigree of a user (e.g., the recorded relationships in a family).
  • the genealogical collection module 220 may be configured to provide an interactive GUI that asks the user questions or provides a menu of options, and receives user input that can be processed to obtain the genealogical data.
  • Examples of genealogical data that may be collected include, but are not limited to, names (first, last, middle, suffixes), birth locations (e.g., county, city, state, country, hospital, global map coordinates), date of birth, date of death, marriage information, family relations (manually provided rather than genetically identified), etc.
  • OCR optical character recognition
  • the pedigree information associated with a user may include a genealogical graph.
  • the genealogical graph may include one or more specified nodes. Each specified node in the genealogical graph represents either the user or an ancestor of the user that could have passed down genetic material to the user.
  • the pedigree information provided by users may or may not be accurate or complete.
  • the genealogical collection module 220 is responsible for filtering the received pedigree data based on one or more quality criteria in an effort to discard lower quality genealogical data. For example, the genealogical collection module 220 may filter the received pedigree data by excluding all pedigree nodes associated with a stored DNA sample that do not satisfy all of the following criteria: (1) recorded death date for a the linked pedigree node corresponds to official records (when available), (2) the gender is the same as the gender provided by the user; and (3) the birth date is within 3 years of the birth date provided by the user. The user may be prompted via GUI to resolve any discrepancies identified by module 220 . In some embodiments, all received genealogical data marked as “private” are excluded from the any subsequent analysis to ensure that any privacy requirements imposed on the data are met.
  • the genotype identification module 240 accesses the collected DNA data from the DNA collection module 210 or the sample store 140 and identifies autosomal SNPs so that the individual's diploid genotype on autosomal chromosomes can be computationally phased.
  • the genotype identification module 240 provides the identified SNPs to the genotype phasing module 250 which phases the individual's diploid genotype based on the set of identified SNPs.
  • the genotype phasing module 250 generates a pair of estimated haplotypes for each diploid genotype.
  • the estimated haplotypes are then stored in the user data store 145 in association with the individual 101 , and may also be stored in association with or verified against the genotypes of the individual's parents, who may also have their own separate accounts in the system 100 .
  • a variety of different computational phasing techniques may be used including, for example, the techniques described in U.S. Patent Application No. 2016/061,568, filed on Jan. 17, 2014, which is hereby incorporated by reference in its entirety.
  • the phasing module 250 stores phased genotypes in the user data store 145 .
  • the IBD estimation module 115 is responsible for identifying IBD segments (also referred to as IBD estimates) from phased genotype data (haplotypes) between the individual 101 and a user stored in the user data store 145 .
  • IBD segments are chromosome segments identified in the individual 101 and a user that are putatively inherited from a recent common ancestor.
  • an individual 101 and a user who are closely related share a relatively large number of IBD segments, and the IBD segments tend to have greater length (individually or in aggregate across one or more chromosomes).
  • an individual 101 and a user who are more distantly related share relatively few IBD segments, and these segments tend to be shorter (individually or in aggregate across one or more chromosomes).
  • the IBD estimation algorithm used by the IBD estimation module 115 to estimate (or infer) IBD segments between an individual 101 and a user is as described in U.S. patent application Ser. No. 14/029,765, filed on Sep. 17, 2013, which is hereby incorporated by reference in its entirety. Another further processing step may be performed on these inferred IBD segments by applying the technique described in PCT Patent Application No. PCT/US2015/055579, filed on Oct. 14, 2015, which is hereby incorporated by reference in its entirety.
  • the identified IBD segments are stored in the user data store 145 in association with the individual 101 .
  • the IBD estimation module 115 is configured to estimate IBD segments between the individual 101 and large numbers of users stored in the user data store 145 .
  • the computing system has been optimized to efficiently handle large amounts of IBD data. Said another way, IBD is estimated across a large number of individuals based on their DNA.
  • the IBD estimation module 115 (and computing system 100 generally) distributes IBD computations over a Hadoop computing cluster, internal to or external from computing system 100 , and stores the phased genotypes used in the IBD computations in a database so that IBD estimates for new accounts/individuals can be quickly compared to previously processed individuals.
  • FIG. 3 depicts the birth location and surname identification module 300 , in accordance with an embodiment.
  • the birth location and surname identification module 300 includes a genetic match module 305 , a location frequency calculator 310 , a surname frequency calculator 315 , a statistical analysis module 320 , an enrichment score module 325 , a list generation module 330 , a location frequency store 350 , a surname frequency store 360 , a location score store 380 , and a surname score store 390 .
  • the genetic match module 305 retrieves the IBD estimates between the individual 101 and the users in the user data store 145 and determines whether the individual 101 and any given other user are a genetic match.
  • the individual 101 and a user are a match if they have higher than a threshold amount of IBD segment sharing, as determined by the IBD estimation module 115 .
  • a match may indicate that the individual 101 and the user are related (e.g. parent/child, sibling, aunt/uncle, first cousin, first cousin once removed, second cousin, second cousin once removed).
  • the genetic match module 305 identifies all users in the user data store 145 that are considered matches to the individual 101 .
  • the set of user matches is referred to herein by M u . In an example, the number of matches is limited to the top 3000 matches (i.e.
  • the genetic match module 305 provides the set of user genetic matches, M u , to the location frequency calculator 310 to determine an identification of possible birth locations associated with the user.
  • the location frequency calculator 310 determines how frequently a particular birth location appears amongst the pedigrees of the users within the set of user matches, M u . To do this, the location frequency calculator 310 retrieves, for each matching user in the set of user matches M u , the matching user's pedigree.
  • the pedigree includes a genealogical graph from the user data store 145 .
  • a genealogical graph in the pedigree may be the matching user's family tree that describes the relationship between the matching user and each of the matching user's relatives.
  • Each relative in the matching user's pedigree has associated genealogical data such as the relative's birth location.
  • T v denotes a set of birth locations indicated in matching user v's pedigree.
  • the location frequency calculator 310 identifies the set of birth locations, T v , in the matching user's pedigree.
  • the location frequency calculator 310 may identify a matching user v as having 10 relatives born in New York City, 2 relatives born in Boston, and 2 relatives born in Los Angeles. Therefore, the elements in T v include New York City, Boston, and Los Angeles.
  • a presence indicator a v,i may be represented by the indicator function representing whether a birth location i is indicated in matching user v's pedigree:
  • a v , 1 ⁇ 1 if ⁇ ⁇ i ⁇ T v 0 otherwise , ( 1 )
  • matching user v has an indicator function score of 1 for birth locations of New York City, Boston, and Los Angeles. All other birth locations (e.g. Washington D.C., elsewhere) would have a presence indicator and corresponding indicator function score of 0.
  • the location frequency calculator 310 repeats this process for the set of matches M u .
  • the total number of pedigrees (m i ) of users in the set of matches that have this birth location is determined according to:
  • the location frequency calculator 310 summates the indicator function score for each birth location.
  • the maximum number of pedigrees (max(m i )) that have the birth location is 1000, which would occur if every user in the set of matches M u has the birth location in their pedigree.
  • the location frequency calculator 310 uses the total number of pedigrees m i , to determine p i , the match frequency of a birth location i, where p i is determined according to:
  • the match frequency of a birth location represents how often matching users in the set of matches M u are associated with the birth location, which can be used as a way of determining an association between the ancestors of the individual 101 and the birth location. This match frequency is stored in the location frequency store 350 .
  • the location frequency calculator 310 also calculates a background frequency for each birth location i.
  • the background frequency of a birth location provides an indication as to how often the birth location appears amongst the greater population of users stored in the system, including those who are not matches to the individual. For example, high population cities such as New York City or Boston may have higher background frequencies than smaller cities such as Cheyenne, Wyo.
  • D represents the total set of users in the system. Generally the number of users in set D is significantly larger (e.g. multiple orders of magnitude larger) than the number of matching users in the set of matches M u . Each user in D may have a corresponding pedigree. Altogether, this forms the set of all pedigrees stored in the user data store 145 .
  • the location frequency calculator 310 may use a similar indicator function as was previously shown in equation (1) to calculate whether a birth location i, exists in the pedigree corresponding to user w in the set D:
  • a w , i ⁇ 1 if ⁇ ⁇ birth ⁇ ⁇ location , i , exists ⁇ ⁇ in ⁇ ⁇ user ⁇ ⁇ w ′ ⁇ s ⁇ ⁇ pedigree 0 otherwise ( 4 )
  • the location frequency calculator 310 summates the total number of pedigrees that each have the birth location, each pedigree corresponding to a user w in the set of D. To calculate the background frequency, the location frequency calculator 310 divides the summated total number of users in the set of D that have the birth location by the total number of users in the set of D. Therefore, the background frequency of a birth location, i, is expressed in equation 5 as
  • This background frequency is stored in the location frequency store 350 .
  • the surname frequency calculator 315 calculates a match frequency and background frequency for each surname in the pedigrees of matching users in a similar fashion as was discussed for birth locations in section III.b.
  • the surname frequency calculator 315 receives the set of user matches, M u , from the genetic match module 305 for an individual 101 and determines a match frequency p j ,that represents how often a given surname, j (e.g. Bradley, O'Malley, Johnson), appears amongst the pedigrees of users, v, in the set of user matches, M u .
  • j e.g. Bradley, O'Malley, Johnson
  • the surname frequency calculator also calculates the background frequency for the surname “Bradley” in the total set of users in the system, D.
  • the match frequency, p j , and background frequency, q j may each be stored in the surname frequency store 360 .
  • the surname frequency calculator 315 may first normalize, meaning that the surname frequency calculator 315 may consider many alternate spellings as being the same surname for purposes of frequency calculations. Examples of such alternate spellings may include use of characters not used in English (e.g., “o” versus “ ⁇ ”); capitalization, punctuation, and spacing (“O'Malley” versus “Omalley”); suffixes (“Jr.”); and commentary (“Johnson (WWII Veteran)”). A simple normalization is performed that ignores capitalization and punctuation, and removes commentary, thereby reducing the set of surnames under consideration. Alternate spellings and misspellings may be interpreted by the surname frequency calculator 315 as a different surname.
  • the statistical analysis module 320 identifies which birth locations and surnames are sufficiently notable for the individual 101 under consideration so as to merit possibly providing to the individual 101 as likely being associated with their own ancestors.
  • M u there may be a total of 1000 users in the set of matches, M u .
  • a match frequency, p i of 10% may appear to be a very high number of appearances for a birth location.
  • the background frequency, q i is also close to 10%, meaning that the birth location appears approximately equally frequently in the pedigrees of all users in the system, then a match frequency, p i , of 10% may not be sufficiently notable to be worth identifying as associated with the individual.
  • the statistical analysis module 320 receives the match frequency, p i , and background frequency, q i , for all different birth locations, i, from the location frequency calculator 310 .
  • the statistical analysis module 320 conducts a statistical analysis test to determine whether the match frequency of a given birth location is sufficiently notable. For each birth location, i, the statistical analysis module 320 determines the likelihood of observing the received match frequency, p i , and background frequency, q i under a null hypothesis H 0 scenario.
  • the alternative hypothesis H 1 is the assumed scenario where the match frequency and background frequency are non-equal (i.e. p i ⁇ q i ), with the assumption that if, particularly, p i >q i , then p i , and thus i, may be statistically significant and therefore worth possibly providing to the individual.
  • the statistical analysis module 320 determines the likelihood of observing the received match frequency and background frequency under the assumption that the match frequency, p i , and background frequency, q i , are equal. However, if the received match frequency is sufficiently larger than the received background frequency, then the null hypothesis H 0 is rejected in favor of the alternative hypothesis H 1 . What constitutes a sufficient difference between the received match frequency, p i , and background frequency, q i , will be discussed further below in regards to the summary statistic S i .
  • a similar calculation may be performed for surnames by receiving the match frequency, p j , and background frequency, q j , for all surname identifications, j, from the surname frequency calculator 315 .
  • the subsequent discussion focuses on conducting a statistical test for a birth location, i. This discussion may also refer to conducting a statistical test for a surname.
  • the statistical test is performed under a null hypothesis H 0 , the assumed scenario where the match frequency, p i , and the background frequency, q i , are equal.
  • the statistical analysis module 320 conducts a maximum likelihood ratio test.
  • the statistical analysis test may be a Pearson's chi-squared test, a Z-test, or a F-test.
  • the test statistic, ⁇ , for the maximum likelihood ratio test is determined according to:
  • H 1 ) denotes the likelihood of observing m i under the alternative hypothesis when varying p between 0 and 1.
  • the test statistic is a ratio between a first likelihood of observing the match frequency and background frequency under the null hypothesis and a second likelihood of observing the match frequency and background frequency under the alternative hypothesis.
  • a summary statistic, S i is determined using ⁇ according to:
  • the statistical analysis module 320 calculates a summary statistic for each birth location i. Note that if the match frequency, p i , and background frequency, q i , for a birth location received from the location frequency calculator 310 are equal, then the value of the summary statistic is zero. Additionally, the summary statistic S i increases in magnitude as the difference between the match frequency, p i , and background frequency, q i , increases in magnitude.
  • the statistical analysis model 320 may calculate the p-value for rejecting the null hypothesis H 0 based on the first order chi-squared distribution of the summary statistic, S i .
  • the null hypothesis is rejected if S i >4 (or 2S i >8) based on the first order chi-squared distribution.
  • S i >4 or 2S i >8 based on the first order chi-squared distribution.
  • the match frequency is sufficiently larger than the background frequency for a particular birthplace or surname such that the summary statistic S i >4, then the alternative hypothesis (i.e. where the match frequency does not equal the background frequency) is accepted. This indicates that the particular birthplace or surname is sufficiently notable to be associated with the ancestors of the individual 101 .
  • the exact value of the significance level may vary by implementation, or according to more specific factors. Also, although the above embodiment describes the significance level as being a p-value, in practice it may be any threshold which determines whether or not a particular birth location i or surname j is sufficiently statistically significant to merit consideration for providing to the individuals.
  • the statistical analysis module 320 may adjust the significance level (e.g., p-value) for a birth location i, based on the country of origin of the birth location.
  • the birth location i from a particular country of origin is determined based on the latitude and longitudinal coordinates associated with the birth location. More specifically, the particular significance level for a country of origin is chosen based on the number of users in the database associated with those countries in their respective pedigrees and the number of matches that a given individual has in the database that are annotated with a pedigree attached to them.
  • a birth location that derives from a country having a large number of users associated with that country may utilize a relatively high significance level (e.g., 0.995), whereas a birth location that derives from a country having relatively few users associated with that country (e.g., Mexico, Russia, Eastern Europe) may utilize a relatively lower significance level (e.g., 0.9).
  • a relatively high significance level e.g. 0.95
  • a birth location that derives from a country having relatively few users associated with that country e.g., Mexico, Russia, Eastern Europe
  • a relatively lower significance level e.g., 0.9
  • the threshold needed to determine whether the difference between the birth location match frequency versus the corresponding background frequency is statistically significant may vary. This allows the module 300 to better take into account the relative availability of data regarding a particular country in determining whether or not particular birth locations are statistically significant.
  • the statistical analysis module 320 determines which birth locations and surnames are statistically significant given the information known about them from the underlying pedigree data from users genetically matched to an individual 101 .
  • the enrichment score module 325 uses this binary determination of statistical significance to determine an enrichment score representing a strength of association between the birth location or surname and the ancestors of the individual 101 .
  • the enrichment score module 325 determines an enrichment score, x i , for each birth location, i, or enrichment score, x j , for each surname, j.
  • the enrichment score module 325 receives the summary statistic, S i , for each birth location or the summary statistic, S j , for each surname. Additionally, the enrichment score module 325 receives the match frequency, p i or p j , and the background frequency, q i and q j , for birth locations and surnames.
  • the enrichment score module 325 calculates the enrichment score to be:
  • the exact form of the calculation may vary in practice, particularly the significance level may vary by country as described above. Note that if the match frequency, p i , and the background frequency, q i , are not significantly different, then the enrichment score is close to zero, indicating that the particular birth location may not be very relevant to ancestors of the individual 101 . Scaling the match frequency by a factor of log p/q eliminates biases towards highly popular birth locations and surnames because they are likely to have a high background frequency (high q) as well, thereby reducing the enrichment score.
  • the enrichment score module 325 calculates the enrichment score for a surname, j, in the same manner according to equation 10. In another embodiment, the enrichment score module 325 calculates the enrichment score for a surname, j, as
  • the respective enrichment scores for a birth location and surname are stored in the location score store 380 and surname score store 390 , respectively.
  • the enrichment score module 325 provides the enrichment score associated with each birth location or surname to the list generation module 330 .
  • the list generation module 330 may rank and/or provide one or more birth locations or surnames to the individual through client device 160 based on their associated enrichment scores. Exactly how the list generation module 330 provides birth locations and surnames, how many, and in what form (e.g., lists, etc.) may vary by implementation.
  • the list generation module 330 may set a minimum threshold in order for a particular birth location or surname to be recommended. For example, a birth location or surname may have to meet a certain minimum threshold enrichment score, and/or must appear in at least some number of pedigrees (e.g., m i ⁇ 3) for it to be recommended.
  • the list generation module 330 generates a list 370 including only the top N birth locations or surnames by enrichment score.
  • the list 370 is sent through the network 120 to the client device 160 for consumption by the individual 101 .
  • FIG. 4 illustrates a process of providing an identified birth location or surname to an individual, according to one embodiment.
  • the birth location and surname identification system 100 receives 405 a sequence of genetic data from an individual 101 .
  • the system 100 identifies users in the system 100 that are genetic matches with the individual 101 . This may be accomplished by identifying DNA segment matches on the pair of haplotypes for the individual and the pair of haplotypes for users retrieved from the user data store 145 .
  • Each user in the system 100 has a corresponding pedigree stored in the user data store 145 . From the set of all pedigrees in the user data store 145 , a subset of matching pedigrees is identified. Each pedigree in the subset of matching pedigrees is associated with a user that is a genetic match of the individual 101 . The system 100 determines 420 a match frequency, p, of the birth location or surname amongst the subset of matching pedigrees. Additionally, the system 100 determines 425 a background frequency, q, of the birth location or surname amongst the set of all pedigrees.
  • the system 100 identifies 430 the likelihood of observing the match frequency and background frequency for a birth location or surname under the assumed scenario that the match frequency and background frequency are equal.
  • the statistical analysis module 320 of the system 100 conducts a statistical test and determines 435 an enrichment score for each birth location or surname. Based on the statistical test and the enrichment score, the system may provide 440 the birth location or surname to the individual 101 .
  • the birth location and surname identification system 100 is implemented using one or more computers having one or more processors executing application code to perform the steps described herein, and data may be stored on any conventional non-transitory storage medium and, where appropriate, include a conventional database server implementation.
  • any conventional non-transitory storage medium and, where appropriate, include a conventional database server implementation.
  • various components of a computer system for example, processors, memory, input devices, network devices and the like are not shown in FIG. 1 .
  • a distributed computing architecture is used to implement the described features.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A system identifies ancestral birth locations or surnames estimated to be associated with an individual's ancestors using an individual's genetic sample. The system identifies users who are genetic matches to the individual and determines whether and how often a birth location or surname appears in the pedigrees of those users. Birth locations or surnames that appear frequently throughout the pedigrees of genetically matching users may represent birth locations or surnames that are affiliated with the individual's ancestors. The system determines whether the frequency of appearance of a birth location or surname is statistically significant to eliminate biases for certain birth locations or surnames that appear more frequently than others. The birth location or surname may be provided to the individual based on an also-determined enrichment score.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. application Ser. No. 15/203,776, filed on Jul. 6, 2016, which claims the benefit of U.S. Provisional Application No. 62/189,422, filed Jul. 7, 2015, both of which are incorporated by reference in their entirety for all purposes.
  • BACKGROUND
  • This description generally relates to population analyses on human genetic and genealogical information, and particularly to using that information to identify ancestral birth locations or ancestral surnames for an individual.
  • An individual may often be interested in learning more about his/her ancestral history including ancestral birth locations and/or ancestral surnames. Families may have genealogical pedigrees or family trees that may be verbally passed down from generation to generation. However, these genealogical family trees become inaccurate as they are passed along or may be missing the birth location or surnames of past ancestors altogether. Therefore, an individual often cannot rely on genealogical data provided by a family member to identify ancestral birth locations or surnames.
  • SUMMARY
  • Described embodiments identify likely birth locations and surnames of an individual's ancestors based on the individual's genotype, genotypes of a population of users who are genetic matches to the individual, and genealogical data (e.g. pedigree or family tree) of those matches. Note that no genealogical data for the individual is necessary for this identification to be performed.
  • In one embodiment, to generate identifications of possible ancestral birth locations and/or surnames for an individual, a birth location and surname identification system receives a genetic sample from the individual. The individual's genetic sample is sequenced and is analyzed to identify users in the system who are genetic matches to the individual. At least some of those genetically matched users will have an associated pedigree that identifies birth locations and/or surnames of their ancestors. A computer system determines the frequency of appearance of a birth location or surname amongst the pedigrees of the genetically matched users and further determines whether that frequency of appearance is of statistical significance. In various embodiments, the system performs a statistical test to prevent recommending birth locations or surnames that may be disproportionally represented. If the frequency of appearance of the birth location or surname is deemed statistically significant, the system may present it to the individual as a recommended ancestral birth location or surname.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, and accompanying drawings, where:
  • FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment.
  • FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic and genealogical data associated with users of the system in accordance with an embodiment.
  • FIG. 3 is a flow diagram for the operation of the birth location and surname identification module, in accordance with an embodiment.
  • FIG. 4 is a flow diagram for identifying ancestral birthplace or surname identifications for an individual, in accordance with an embodiment.
  • Note that for purposes of clarity, only one of each item corresponding to a reference numeral is included in most figures, but when implemented, multiple instances of any or all of the depicted modules may be employed, as will be appreciated by those of skill in the art.
  • DETAILED DESCRIPTION Environment Overview
  • FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment. Depicted in FIG. 1 are an individual 101 (i.e. a human or other organism), a DNA extraction service 102, a birth location and surname identification system 100, a network 120, and a client device 160.
  • Individual 101 provides a DNA sample for analysis. In one embodiment, an individual 101 uses a sample collection kit to provide a DNA sample, e.g., saliva, from which genetic data can be reliably extracted according to conventional DNA processing techniques. DNA extraction service 102 receives the sample and estimates genotypes from the genetic data, for example by extracting the DNA from the sample and identifying genotype values of single nucleotide polymorphisms (SNPs) present within the DNA. The result in this example is a diploid genotype for each SNP. The birth location and surname identification system 100 receives the genetic data from DNA extraction service 102 and stores the genetic data in a DNA sample store 140 containing DNA diploid genotypes. The genetic data stored in the DNA sample store 140 may be associated with the individual 101 in the user data store 145 via one or more pointers.
  • Identifying ancestral birth locations or surnames that may be associated with a given individual involves analyzing genealogical information of other individuals that are genetic matches with the individual. To determine the genetic matches, analysis of identity-by-descent (IBD) is used. IBD analysis can be used to identify the familial relationship between any two people (e.g., second cousins) in a population as long as the relationship is due to shared common ancestors from the recent past (e.g., on the order of several hundred years). To date, IBD analysis has not been successfully used to accurately identify ancestral birth locations or surnames from an individual's genetic data.
  • To perform IBD, the birth location and surname identification system 100 includes an input data processing module 110 that processes the DNA to identify shared segments of DNA data between the individual 101 and a number of other users whose DNA is already stored by the system. An IBD estimation module 115 uses the shared segments of DNA are used to identify those other users known in the user data store 145 whose genetic data is stored in the DNA sample store 140 who are matches to the individual. The birth location and surname identification module 300 uses the match information to access genealogical data of the individual's 101 genetic matches in order to identify possible surnames and birth locations for the ancestors of the individual 101.
  • Each of these modules is described in further detail below. The breakdown of the logical functions of the system 100 into the above-introduced modules is for clarity of description only. In other embodiments, the computer system 100 may comprise more or fewer modules, and the logical structure may be differently organized. The data stores may be represented in different ways in different embodiments, such as comma-separated text files, or as databases such as relational databases (SQL) or non-relational databases (NoSQL).
  • The network 120 facilitates communications amongst one or more client devices 160 and the system 100. The network 120 may be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet. In various embodiments, the network 120 uses standard communication technologies and/or protocols. Examples of technologies used by the network 120 include Ethernet, 802.11, 3G, 4G, 802.16, or any other suitable communication technology. The network 120 may use wireless, wired, or a combination of wireless and wired communication technologies. Examples of protocols used by the network 120 include transmission control protocol/Internet protocol (TCP/IP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), file transfer protocol (TCP), or any other suitable communication protocol.
  • A client device 160 is a computing device capable of transmitting and/or receiving data via the network 120. In various embodiments, the client device 160 belongs to the individual 101 that provided the genetic sample. Examples of client devices 160 include desktop computers, laptop computers, tablet computers (pads), mobile phones, personal digital assistants (PDAs), gaming devices, or any other electronic device including computing functionality and data communication capabilities. The client device 160 may use a web browser 180, such as Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, Apple Safari and/or Opera, as an interface to connect with the network 120. Additionally or alternatively, specialized application software 180 that runs native on a mobile device is used as an interface to connect to the network 120.
  • The birth location and surname identification system 100 sends, through the network 120, a list of surnames or birthplaces to the client device 160 identified by system 100 for presentation to the individual 101. The list of surnames or birthplaces may be presented by the client device 160 on a user interface or a display screen.
  • Although not necessarily a part of any particular illustrated module, a new user to the system 100 who is submitting their DNA among other data will activate a new account, often through graphical user interface (GUI) provided through a mobile software application or a web-based interface. As part of the account activation process, the system 100 receives one or more types of basic personal information about the individual 101 such as age, date of birth, geographical location of birth (e.g., city, state, county, country, hospital, etc.), complete name including first, last middle names as well as any suffixes, and gender. This received user information is stored in the user data store 145, in association with the corresponding DNA samples stored in the DNA sample store 140.
  • Genetic Data Processing
  • To process the data stored in the DNA sample store 140 and estimate IBD from the DNA samples, the computing system 100 comprises an input data processing module 110, and an IBD estimation module 115. These modules are described in relation with FIG. 2 which is a flow diagram for the operation of the computer system 100 for estimating and storing estimated IBD in accordance with an embodiment.
  • II.a. DNA Sample Receipt and Account Creation
  • FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic, genealogical and survey input data. The input data processing module 110 is responsible for receiving, storing and processing data received from an individual 101 via the DNA extraction service 102. The input data processing module 110 includes a DNA collection module 210, a genealogical collection module 220, genotype identification module 240, and a genotype phasing module 250.
  • The DNA collection module 210 is responsible for receiving sample data from external sources (e.g., extraction service 102), processing and storing the samples in the DNA sample store 140. The data stored in the DNA sample store 140 may store one or more received samples DNA linked to a user as a <key, value> pair associated with the individual 101. In one instance, the <key, value> pair is <sampleID, “GA TC TC AA”>. The data stored in the DNA sample store 140 may be identified by one or more keys used to index one or more values associated with an individual 101. In one example, keys are a userID and sampleID, or alternatively another <key, value> pair is <userID, sampleID>. In various embodiments, the DNA sample store 140 stores a pointer to a location associated with the user data store 145 associated with the individual 101. The user data store 145 will be further described below.
  • II.b. Genealogical Data
  • The genealogical collection module 220 both receives and processes genealogical data and stores the data in the user data store 145. This data may be received for the individual 101, and may have been received in the past for other users of the system, some of whom may be determined to be genetic matches to the individual 101.
  • The genealogical data may include a variety of different types of information. The genealogical data can take the form of a pedigree of a user (e.g., the recorded relationships in a family). To collect the data, the genealogical collection module 220 may be configured to provide an interactive GUI that asks the user questions or provides a menu of options, and receives user input that can be processed to obtain the genealogical data. Examples of genealogical data that may be collected include, but are not limited to, names (first, last, middle, suffixes), birth locations (e.g., county, city, state, country, hospital, global map coordinates), date of birth, date of death, marriage information, family relations (manually provided rather than genetically identified), etc. These data may be manually provided or automatically extracted via, for example, optical character recognition (OCR) performed on census records, town or government records, or any other item of printed or online material.
  • The pedigree information associated with a user may include a genealogical graph. For example, the genealogical graph may include one or more specified nodes. Each specified node in the genealogical graph represents either the user or an ancestor of the user that could have passed down genetic material to the user.
  • The pedigree information provided by users may or may not be accurate or complete. The genealogical collection module 220 is responsible for filtering the received pedigree data based on one or more quality criteria in an effort to discard lower quality genealogical data. For example, the genealogical collection module 220 may filter the received pedigree data by excluding all pedigree nodes associated with a stored DNA sample that do not satisfy all of the following criteria: (1) recorded death date for a the linked pedigree node corresponds to official records (when available), (2) the gender is the same as the gender provided by the user; and (3) the birth date is within 3 years of the birth date provided by the user. The user may be prompted via GUI to resolve any discrepancies identified by module 220. In some embodiments, all received genealogical data marked as “private” are excluded from the any subsequent analysis to ensure that any privacy requirements imposed on the data are met.
  • II.c. Processing and Phasing DNA Samples
  • The genotype identification module 240 accesses the collected DNA data from the DNA collection module 210 or the sample store 140 and identifies autosomal SNPs so that the individual's diploid genotype on autosomal chromosomes can be computationally phased. The genotype identification module 240 provides the identified SNPs to the genotype phasing module 250 which phases the individual's diploid genotype based on the set of identified SNPs. The genotype phasing module 250 generates a pair of estimated haplotypes for each diploid genotype. The estimated haplotypes are then stored in the user data store 145 in association with the individual 101, and may also be stored in association with or verified against the genotypes of the individual's parents, who may also have their own separate accounts in the system 100. A variety of different computational phasing techniques may be used including, for example, the techniques described in U.S. Patent Application No. 2016/061,568, filed on Jan. 17, 2014, which is hereby incorporated by reference in its entirety. The phasing module 250 stores phased genotypes in the user data store 145.
  • II.d. IBD Estimation
  • The IBD estimation module 115 is responsible for identifying IBD segments (also referred to as IBD estimates) from phased genotype data (haplotypes) between the individual 101 and a user stored in the user data store 145. IBD segments are chromosome segments identified in the individual 101 and a user that are putatively inherited from a recent common ancestor. Typically, an individual 101 and a user who are closely related share a relatively large number of IBD segments, and the IBD segments tend to have greater length (individually or in aggregate across one or more chromosomes). Alternatively, an individual 101 and a user who are more distantly related share relatively few IBD segments, and these segments tend to be shorter (individually or in aggregate across one or more chromosomes).
  • In one embodiment, the IBD estimation algorithm used by the IBD estimation module 115 to estimate (or infer) IBD segments between an individual 101 and a user is as described in U.S. patent application Ser. No. 14/029,765, filed on Sep. 17, 2013, which is hereby incorporated by reference in its entirety. Another further processing step may be performed on these inferred IBD segments by applying the technique described in PCT Patent Application No. PCT/US2015/055579, filed on Oct. 14, 2015, which is hereby incorporated by reference in its entirety. The identified IBD segments are stored in the user data store 145 in association with the individual 101.
  • The IBD estimation module 115 is configured to estimate IBD segments between the individual 101 and large numbers of users stored in the user data store 145. In some embodiments of this module, the computing system has been optimized to efficiently handle large amounts of IBD data. Said another way, IBD is estimated across a large number of individuals based on their DNA. For example, in one implementation, the IBD estimation module 115 (and computing system 100 generally) distributes IBD computations over a Hadoop computing cluster, internal to or external from computing system 100, and stores the phased genotypes used in the IBD computations in a database so that IBD estimates for new accounts/individuals can be quickly compared to previously processed individuals.
  • Birth Location and Surname Identification System
  • III.a. Identifying Genetic Matches
  • FIG. 3 depicts the birth location and surname identification module 300, in accordance with an embodiment. The birth location and surname identification module 300 includes a genetic match module 305, a location frequency calculator 310, a surname frequency calculator 315, a statistical analysis module 320, an enrichment score module 325, a list generation module 330, a location frequency store 350, a surname frequency store 360, a location score store 380, and a surname score store 390.
  • The genetic match module 305 retrieves the IBD estimates between the individual 101 and the users in the user data store 145 and determines whether the individual 101 and any given other user are a genetic match. In one embodiment, the individual 101 and a user are a match if they have higher than a threshold amount of IBD segment sharing, as determined by the IBD estimation module 115. A match may indicate that the individual 101 and the user are related (e.g. parent/child, sibling, aunt/uncle, first cousin, first cousin once removed, second cousin, second cousin once removed). The genetic match module 305 identifies all users in the user data store 145 that are considered matches to the individual 101. The set of user matches is referred to herein by Mu. In an example, the number of matches is limited to the top 3000 matches (i.e. |Mu|≤3000) sorted based on amount of IBD segment sharing.
  • III.b. Match and Background Frequency for a Birth Location
  • The genetic match module 305 provides the set of user genetic matches, Mu, to the location frequency calculator 310 to determine an identification of possible birth locations associated with the user. The location frequency calculator 310 determines how frequently a particular birth location appears amongst the pedigrees of the users within the set of user matches, Mu. To do this, the location frequency calculator 310 retrieves, for each matching user in the set of user matches Mu, the matching user's pedigree. The pedigree includes a genealogical graph from the user data store 145. For example, a genealogical graph in the pedigree may be the matching user's family tree that describes the relationship between the matching user and each of the matching user's relatives. Each relative in the matching user's pedigree has associated genealogical data such as the relative's birth location.
  • For a given matching user v ∈ Mu, Tv denotes a set of birth locations indicated in matching user v's pedigree. For each matching user, the location frequency calculator 310 identifies the set of birth locations, Tv, in the matching user's pedigree. For example, the location frequency calculator 310 may identify a matching user v as having 10 relatives born in New York City, 2 relatives born in Boston, and 2 relatives born in Los Angeles. Therefore, the elements in Tv include New York City, Boston, and Los Angeles. A presence indicator av,i, may be represented by the indicator function representing whether a birth location i is indicated in matching user v's pedigree:
  • a v , 1 = { 1 if i T v 0 otherwise , ( 1 )
  • Thus, for this example, matching user v has an indicator function score of 1 for birth locations of New York City, Boston, and Los Angeles. All other birth locations (e.g. Washington D.C., elsewhere) would have a presence indicator and corresponding indicator function score of 0.
  • The location frequency calculator 310 repeats this process for the set of matches Mu. For a given birth location, i, the total number of pedigrees (mi) of users in the set of matches that have this birth location is determined according to:
  • m i = v M u a v , i ( 2 )
  • For example, the location frequency calculator 310 summates the indicator function score for each birth location. Thus, if the number of matching users is 1000, the maximum number of pedigrees (max(mi)) that have the birth location is 1000, which would occur if every user in the set of matches Mu has the birth location in their pedigree.
  • The location frequency calculator 310 uses the total number of pedigrees mi, to determine pi, the match frequency of a birth location i, where pi is determined according to:

  • p i =m i/|M u|  (3)
  • where |Mu| is the total number of matching users in the set Mu. Returning to the previous example, if all 1000 users in the set of matches Mu (e.g. |Mu|=1000) had the birth location of New York City in their pedigree (e.g. mi=1000), then the match frequency of New York City is pNew York City=1. Therefore, the match frequency of a birth location represents how often matching users in the set of matches Mu are associated with the birth location, which can be used as a way of determining an association between the ancestors of the individual 101 and the birth location. This match frequency is stored in the location frequency store 350.
  • The location frequency calculator 310 also calculates a background frequency for each birth location i. The background frequency of a birth location provides an indication as to how often the birth location appears amongst the greater population of users stored in the system, including those who are not matches to the individual. For example, high population cities such as New York City or Boston may have higher background frequencies than smaller cities such as Cheyenne, Wyo. Here, D represents the total set of users in the system. Generally the number of users in set D is significantly larger (e.g. multiple orders of magnitude larger) than the number of matching users in the set of matches Mu. Each user in D may have a corresponding pedigree. Altogether, this forms the set of all pedigrees stored in the user data store 145.
  • To determine the background frequency, the location frequency calculator 310 may use a similar indicator function as was previously shown in equation (1) to calculate whether a birth location i, exists in the pedigree corresponding to user w in the set D:
  • a w , i = { 1 if birth location , i , exists in user w s pedigree 0 otherwise ( 4 )
  • The location frequency calculator 310 summates the total number of pedigrees that each have the birth location, each pedigree corresponding to a user w in the set of D. To calculate the background frequency, the location frequency calculator 310 divides the summated total number of users in the set of D that have the birth location by the total number of users in the set of D. Therefore, the background frequency of a birth location, i, is expressed in equation 5 as
  • q i = 1 D w D a w , i ( 5 )
  • This background frequency is stored in the location frequency store 350.
  • III.c. Match and Background Frequency for a Surname
  • The surname frequency calculator 315 calculates a match frequency and background frequency for each surname in the pedigrees of matching users in a similar fashion as was discussed for birth locations in section III.b. The surname frequency calculator 315 receives the set of user matches, Mu, from the genetic match module 305 for an individual 101 and determines a match frequency pj,that represents how often a given surname, j (e.g. Bradley, O'Malley, Johnson), appears amongst the pedigrees of users, v, in the set of user matches, Mu. For example, for a surname “Bradley” (e.g. j=“Bradley”),
  • p j = Σ v M u a v , j M u ( 6 )
  • where av,j is the indicator function previously described in equation (1). The surname frequency calculator also calculates the background frequency for the surname “Bradley” in the total set of users in the system, D.
  • q j = 1 D Σ w D a w , j ( 7 )
  • The match frequency, pj, and background frequency, qj, may each be stored in the surname frequency store 360.
  • Often, there will be many more surnames in the user store 145 than birth locations. Many of those surnames will be similar to others, with only minor variations in spelling. Unmodified, these variant spellings may reduce the efficacy of the surname frequency calculation. To address this, the surname frequency calculator 315 may first normalize, meaning that the surname frequency calculator 315 may consider many alternate spellings as being the same surname for purposes of frequency calculations. Examples of such alternate spellings may include use of characters not used in English (e.g., “o” versus “ø”); capitalization, punctuation, and spacing (“O'Malley” versus “Omalley”); suffixes (“Jr.”); and commentary (“Johnson (WWII Veteran)”). A simple normalization is performed that ignores capitalization and punctuation, and removes commentary, thereby reducing the set of surnames under consideration. Alternate spellings and misspellings may be interpreted by the surname frequency calculator 315 as a different surname.
  • III.d Calculating the Statistical Likelihood
  • The statistical analysis module 320 identifies which birth locations and surnames are sufficiently notable for the individual 101 under consideration so as to merit possibly providing to the individual 101 as likely being associated with their own ancestors.
  • For example, there may be a total of 1000 users in the set of matches, Mu. Assume that 10% of the users that are in the set of matches, Mu, have a particular birth location in their pedigree (i.e. mi=100, therefore pi=0.1). A match frequency, pi, of 10% may appear to be a very high number of appearances for a birth location. However, if the background frequency, qi, is also close to 10%, meaning that the birth location appears approximately equally frequently in the pedigrees of all users in the system, then a match frequency, pi, of 10% may not be sufficiently notable to be worth identifying as associated with the individual.
  • The statistical analysis module 320 receives the match frequency, pi, and background frequency, qi, for all different birth locations, i, from the location frequency calculator 310. The statistical analysis module 320 conducts a statistical analysis test to determine whether the match frequency of a given birth location is sufficiently notable. For each birth location, i, the statistical analysis module 320 determines the likelihood of observing the received match frequency, pi, and background frequency, qi under a null hypothesis H0 scenario. An example of a null hypothesis H0 is the assumed scenario where the match frequency and background frequency are the same (i.e. pi=qi). Conversely, the alternative hypothesis H1 is the assumed scenario where the match frequency and background frequency are non-equal (i.e. pi≠qi), with the assumption that if, particularly, pi>qi, then pi, and thus i, may be statistically significant and therefore worth possibly providing to the individual.
  • Therefore, the statistical analysis module 320 determines the likelihood of observing the received match frequency and background frequency under the assumption that the match frequency, pi, and background frequency, qi, are equal. However, if the received match frequency is sufficiently larger than the received background frequency, then the null hypothesis H0 is rejected in favor of the alternative hypothesis H1. What constitutes a sufficient difference between the received match frequency, pi, and background frequency, qi, will be discussed further below in regards to the summary statistic Si.
  • A similar calculation may be performed for surnames by receiving the match frequency, pj, and background frequency, qj, for all surname identifications, j, from the surname frequency calculator 315. The subsequent discussion focuses on conducting a statistical test for a birth location, i. This discussion may also refer to conducting a statistical test for a surname.
  • The statistical test is performed under a null hypothesis H0, the assumed scenario where the match frequency, pi, and the background frequency, qi, are equal. In various embodiments, the statistical analysis module 320 conducts a maximum likelihood ratio test. In other examples, the statistical analysis test may be a Pearson's chi-squared test, a Z-test, or a F-test. The test statistic, Λ, for the maximum likelihood ratio test is determined according to:
  • Λ = L ( m i | H 0 ) max p ( 0 , 1 ) L ( m i | H 1 ) ( 8 )
  • where L(mi|H0) denotes the likelihood of observing mi under the null hypothesis that the match frequency and background frequency are equal (i.e. pi=qi) for a birth location i and maxp∈(0,1)L(mi|H1) denotes the likelihood of observing mi under the alternative hypothesis when varying p between 0 and 1. Thus, the test statistic is a ratio between a first likelihood of observing the match frequency and background frequency under the null hypothesis and a second likelihood of observing the match frequency and background frequency under the alternative hypothesis.
  • A summary statistic, Si, is determined using Λ according to:
  • S i = - log ( Λ ) = m i log p i q i + ( M u - m i ) log 1 - p i 1 - q i , ( 9 )
  • The statistical analysis module 320 calculates a summary statistic for each birth location i. Note that if the match frequency, pi, and background frequency, qi, for a birth location received from the location frequency calculator 310 are equal, then the value of the summary statistic is zero. Additionally, the summary statistic Si increases in magnitude as the difference between the match frequency, pi, and background frequency, qi, increases in magnitude.
  • According to Wilks' theorem, as the sample size increases, twice the summary statistic 2Si will follow a first order chi-squared distribution. Therefore, the statistical analysis model 320 may calculate the p-value for rejecting the null hypothesis H0 based on the first order chi-squared distribution of the summary statistic, Si.
  • For example at a significance level of 0.995 (p-value=5×10−3), the null hypothesis is rejected if Si>4 (or 2Si>8) based on the first order chi-squared distribution. In other words, if the match frequency is sufficiently larger than the background frequency for a particular birthplace or surname such that the summary statistic Si>4, then the alternative hypothesis (i.e. where the match frequency does not equal the background frequency) is accepted. This indicates that the particular birthplace or surname is sufficiently notable to be associated with the ancestors of the individual 101.
  • The exact value of the significance level may vary by implementation, or according to more specific factors. Also, although the above embodiment describes the significance level as being a p-value, in practice it may be any threshold which determines whether or not a particular birth location i or surname j is sufficiently statistically significant to merit consideration for providing to the individuals.
  • For birth locations specifically, the statistical analysis module 320 may adjust the significance level (e.g., p-value) for a birth location i, based on the country of origin of the birth location. In various embodiments, the birth location i from a particular country of origin is determined based on the latitude and longitudinal coordinates associated with the birth location. More specifically, the particular significance level for a country of origin is chosen based on the number of users in the database associated with those countries in their respective pedigrees and the number of matches that a given individual has in the database that are annotated with a pedigree attached to them. For example, a birth location that derives from a country having a large number of users associated with that country (e.g., United States or Nordic countries) may utilize a relatively high significance level (e.g., 0.995), whereas a birth location that derives from a country having relatively few users associated with that country (e.g., Mexico, Russia, Eastern Europe) may utilize a relatively lower significance level (e.g., 0.9). Thus, depending upon the country, the threshold needed to determine whether the difference between the birth location match frequency versus the corresponding background frequency is statistically significant may vary. This allows the module 300 to better take into account the relative availability of data regarding a particular country in determining whether or not particular birth locations are statistically significant.
  • III.e Calculating the Enrichment Score
  • The statistical analysis module 320 determines which birth locations and surnames are statistically significant given the information known about them from the underlying pedigree data from users genetically matched to an individual 101. The enrichment score module 325 uses this binary determination of statistical significance to determine an enrichment score representing a strength of association between the birth location or surname and the ancestors of the individual 101.
  • To do this, the enrichment score module 325 determines an enrichment score, xi, for each birth location, i, or enrichment score, xj, for each surname, j. The enrichment score module 325 receives the summary statistic, Si, for each birth location or the summary statistic, Sj, for each surname. Additionally, the enrichment score module 325 receives the match frequency, pi or pj, and the background frequency, qi and qj, for birth locations and surnames.
  • To calculate an enrichment score of a birth location, i, at the previously selected significance level of 0.995, the enrichment score module 325 calculates the enrichment score to be:
  • x i = { p i * log p i q i if S i > 4 0 otherwise ( 10 )
  • The exact form of the calculation may vary in practice, particularly the significance level may vary by country as described above. Note that if the match frequency, pi, and the background frequency, qi, are not significantly different, then the enrichment score is close to zero, indicating that the particular birth location may not be very relevant to ancestors of the individual 101. Scaling the match frequency by a factor of log p/q eliminates biases towards highly popular birth locations and surnames because they are likely to have a high background frequency (high q) as well, thereby reducing the enrichment score.
  • In one embodiments, the enrichment score module 325 calculates the enrichment score for a surname, j, in the same manner according to equation 10. In another embodiment, the enrichment score module 325 calculates the enrichment score for a surname, j, as
  • x j = p j * log p j q j ( 10 )
  • for all summary statistic values.
  • The respective enrichment scores for a birth location and surname are stored in the location score store 380 and surname score store 390, respectively.
  • III.f Generating a List of Identified Birth Locations or Surnames
  • The enrichment score module 325 provides the enrichment score associated with each birth location or surname to the list generation module 330. The list generation module 330 may rank and/or provide one or more birth locations or surnames to the individual through client device 160 based on their associated enrichment scores. Exactly how the list generation module 330 provides birth locations and surnames, how many, and in what form (e.g., lists, etc.) may vary by implementation. In one embodiment, the list generation module 330 may set a minimum threshold in order for a particular birth location or surname to be recommended. For example, a birth location or surname may have to meet a certain minimum threshold enrichment score, and/or must appear in at least some number of pedigrees (e.g., mi≥3) for it to be recommended.
  • In one embodiment, the list generation module 330 generates a list 370 including only the top N birth locations or surnames by enrichment score. The list 370 is sent through the network 120 to the client device 160 for consumption by the individual 101.
  • Identifying a Birth Location or Surname
  • FIG. 4 illustrates a process of providing an identified birth location or surname to an individual, according to one embodiment. The birth location and surname identification system 100 receives 405 a sequence of genetic data from an individual 101. The system 100 identifies users in the system 100 that are genetic matches with the individual 101. This may be accomplished by identifying DNA segment matches on the pair of haplotypes for the individual and the pair of haplotypes for users retrieved from the user data store 145.
  • Each user in the system 100 has a corresponding pedigree stored in the user data store 145. From the set of all pedigrees in the user data store 145, a subset of matching pedigrees is identified. Each pedigree in the subset of matching pedigrees is associated with a user that is a genetic match of the individual 101. The system 100 determines 420 a match frequency, p, of the birth location or surname amongst the subset of matching pedigrees. Additionally, the system 100 determines 425 a background frequency, q, of the birth location or surname amongst the set of all pedigrees.
  • The system 100 identifies 430 the likelihood of observing the match frequency and background frequency for a birth location or surname under the assumed scenario that the match frequency and background frequency are equal. The statistical analysis module 320 of the system 100 conducts a statistical test and determines 435 an enrichment score for each birth location or surname. Based on the statistical test and the enrichment score, the system may provide 440 the birth location or surname to the individual 101.
  • Additional Considerations
  • The birth location and surname identification system 100 is implemented using one or more computers having one or more processors executing application code to perform the steps described herein, and data may be stored on any conventional non-transitory storage medium and, where appropriate, include a conventional database server implementation. For purposes of clarity and because they are well known to those of skill in the art, various components of a computer system, for example, processors, memory, input devices, network devices and the like are not shown in FIG. 1. In some embodiments, a distributed computing architecture is used to implement the described features.
  • In addition to the embodiments specifically described above, those of skill in the art will appreciate that the invention may additionally be practiced in other embodiments. Within this written description, the particular naming of the components, capitalization of terms, the attributes, data structures, or any other programming or structural aspect is not mandatory or significant unless otherwise noted, and the mechanisms that implement the described invention or its features may have different names, formats, or protocols. Further, the system may be implemented via a combination of hardware and software, as described, or entirely in hardware elements. Also, the particular division of functionality between the various system components described here is not mandatory; functions performed by a single module or system component may instead be performed by multiple components, and functions performed by multiple components may instead be performed by a single component. Likewise, the order in which method steps are performed is not mandatory unless otherwise noted or logically required. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
  • Algorithmic descriptions and representations included in this description are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules or code devices, without loss of generality.
  • Unless otherwise indicated, discussions utilizing terms such as “selecting” or “determining” or “estimating” or the like refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
  • Finally, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.

Claims (20)

1. A method comprising:
receiving a genetic dataset of an individual;
identifying a set of related individuals who are related to the individual based on the genetic dataset, wherein the set of related individuals are identified through an Identity-By-Descent (IBD) estimation based on shared segments of DNA data between the genetic dataset of the individual and genetic datasets of the set of related individuals wherein the shared segments are from phased genetic data of the individual and the set of related individuals;
identifying a genetic group associated with the individual, the genetic group containing the set of related individuals;
identifying a surname associated with the genetic group, wherein the surname is determined to be a significant surname to the genetic group based on a frequency of the surname in the genetic group; and
outputting the surname.
2. The method of claim 1, wherein the genetic group corresponds to a geographic region.
3. The method of claim 1, wherein the surname is associated with a significance score that indicates a significance level of the surname to the genetic group.
4. The method of claim 3, wherein the significance score is determined based on a match frequency and a background frequency associated with the surname.
5. The method of claim 1, further comprising:
identifying one or more additional surnames that are significant to the genetic group;
outputting the one or more additional surnames.
6. The method of claim 5, wherein the one or more additional surnames are each associated with a significance score to the genetic group and the one or more additional surnames are outputted in an order based on the significance scores.
7. The method of claim 1 wherein the phased genetic data comprises a pair of haplotypes for the individual.
8. The method of claim 7 wherein the set of related individuals are IBD matches based on the pair of haplotypes for the individual and haplotypes of the set of related individuals.
9. The method of claim 1, wherein identifying a surname associated with the genetic group comprises:
accessing one or more of pedigrees of the set of related individuals, each pedigree comprising a genealogical graph of relatives for a member of the set of related individuals;
identifying a frequency of the surname in the genetic group and in the one or more of pedigrees; and
determining the surname is signficant based on the frequency.
10. A non-transitory computer-readable medium comprising computer program code, the computer program code when executed by a processor causing the processor to perform steps comprising:
receiving a genetic dataset of an individual;
identifying a set of related individuals who are related to the individual based on the genetic dataset, wherein the set of related individuals are identified through an Identity-By-Descent (IBD) estimation based on shared segments of DNA data between the genetic dataset of the individual and genetic datasets of the set of related individuals wherein the shared segments are from phased genetic data of the individual and the set of related individuals;
identifying a genetic group associated with the individual, the genetic group containing the set of related individuals;
identifying a surname associated with the genetic group, wherein the surname is determined to be a significant surname to the genetic group based on a frequency of the surname in the genetic group; and
outputting the surname.
11. The non-transitory computer-readable medium of claim 10, wherein the genetic group corresponds to a geographic region.
12. The non-transitory computer-readable medium of claim 10, wherein the surname is associated with a significance score that indicates a significance level of the surname to the genetic group.
13. The non-transitory computer-readable medium of claim 12, wherein the significance score is determined based on a match frequency and a background frequency associated with the surname.
14. The non-transitory computer-readable medium of claim 10, wherein the steps further comprising:
identifying one or more additional surnames that are significant to the genetic group;
outputting the one or more additional surnames.
15. The non-transitory computer-readable medium of claim 14, wherein the one or more additional surnames are each associated with a significance score to the genetic group and the one or more additional surnames are outputted in an order based on the significance scores.
16. The non-transitory computer-readable medium of claim 10, wherein the phased genetic data comprises a pair of haplotypes for the individual.
17. The non-transitory computer-readable medium of claim 16, wherein the set of related individuals are IBD matches based on the pair of haplotypes for the individual and haplotypes of the set of related individuals.
18. The non-transitory computer-readable medium of claim 10, wherein identifying a surname associated with the genetic group comprises:
accessing one or more of pedigrees of the set of related individuals, each pedigree comprising a genealogical graph of relatives for a member of the set of related individuals;
identifying a frequency of the surname in the genetic group and in the one or more of pedigrees; and
determining the surname is signficant based on the frequency.
19. A computer system comprising:
one or more processors; and
a non-transitory computer readable storage medium storing instructions, when executed by one or more processors, causing the one or more processors to perform steps comprising:
receiving a genetic dataset of an individual;
identifying a set of related individuals who are related to the individual based on the genetic dataset, wherein the set of related individuals are identified through an Identity-By-Descent (IBD) estimation based on shared segments of DNA data between the genetic dataset of the individual and genetic datasets of the set of related individuals wherein the shared segments are from phased genetic data of the individual and the set of related individuals;
identifying a genetic group associated with the individual, the genetic group containing the set of related individuals;
identifying a surname associated with the genetic group, wherein the surname is determined to be a significant surname to the genetic group based on a frequency of the surname in the genetic group; and
outputting the surname.
20. The system of claim 19, wherein identifying a surname associated with the genetic group comprises steps:
accessing one or more of pedigrees of the set of related individuals, each pedigree comprising a genealogical graph of relatives for a member of the set of related individuals;
identifying a frequency of the surname in the genetic group and in the one or more of pedigrees; and
determining the surname is significant based on the frequency.
US17/184,451 2015-07-07 2021-02-24 Genetic and genealogical analysis for identification of birth location and surname information Pending US20210183474A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/184,451 US20210183474A1 (en) 2015-07-07 2021-02-24 Genetic and genealogical analysis for identification of birth location and surname information

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201562189422P 2015-07-07 2015-07-07
US15/203,776 US10957422B2 (en) 2015-07-07 2016-07-06 Genetic and genealogical analysis for identification of birth location and surname information
US17/184,451 US20210183474A1 (en) 2015-07-07 2021-02-24 Genetic and genealogical analysis for identification of birth location and surname information

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/203,776 Continuation US10957422B2 (en) 2015-07-07 2016-07-06 Genetic and genealogical analysis for identification of birth location and surname information

Publications (1)

Publication Number Publication Date
US20210183474A1 true US20210183474A1 (en) 2021-06-17

Family

ID=57685282

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/203,776 Active 2037-06-07 US10957422B2 (en) 2015-07-07 2016-07-06 Genetic and genealogical analysis for identification of birth location and surname information
US17/184,451 Pending US20210183474A1 (en) 2015-07-07 2021-02-24 Genetic and genealogical analysis for identification of birth location and surname information

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/203,776 Active 2037-06-07 US10957422B2 (en) 2015-07-07 2016-07-06 Genetic and genealogical analysis for identification of birth location and surname information

Country Status (7)

Country Link
US (2) US10957422B2 (en)
EP (1) EP3320469A4 (en)
AU (2) AU2016290989A1 (en)
CA (1) CA2991230C (en)
MX (1) MX391111B (en)
NZ (1) NZ739413A (en)
WO (1) WO2017006284A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9336177B2 (en) 2007-10-15 2016-05-10 23Andme, Inc. Genome sharing
WO2009051766A1 (en) 2007-10-15 2009-04-23 23Andme, Inc. Family inheritance
US8463554B2 (en) 2008-12-31 2013-06-11 23Andme, Inc. Finding relatives in a database
US8990250B1 (en) 2011-10-11 2015-03-24 23Andme, Inc. Cohort selection with privacy protection
US10437858B2 (en) 2011-11-23 2019-10-08 23Andme, Inc. Database and data processing system for use with a network-based personal genetics services platform
US10025877B2 (en) 2012-06-06 2018-07-17 23Andme, Inc. Determining family connections of individuals in a database
US9213947B1 (en) 2012-11-08 2015-12-15 23Andme, Inc. Scalable pipeline for local ancestry inference
US9977708B1 (en) 2012-11-08 2018-05-22 23Andme, Inc. Error correction in ancestry classification
BR112020020430A2 (en) * 2018-04-05 2021-03-30 Ancestry. Com Dna, Llc COMMUNITY ASSIGNMENTS IN IDENTITY BY LINES AND ORIGIN OF GENETIC VARIETY NETWORKS
WO2019243969A1 (en) * 2018-06-19 2019-12-26 Ancestry.Com Dna, Llc Filtering genetic networks to discover populations of interest
US11514627B2 (en) 2019-09-13 2022-11-29 23Andme, Inc. Methods and systems for determining and displaying pedigrees
US11817176B2 (en) 2020-08-13 2023-11-14 23Andme, Inc. Ancestry composition determination
WO2022076909A1 (en) 2020-10-09 2022-04-14 23Andme, Inc. Formatting and storage of genetic markers

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6277567B1 (en) * 1997-02-18 2001-08-21 Fitolink Corporation Methods for the construction of genealogical trees using Y chromosome polymorphisms
US6633819B2 (en) * 1999-04-15 2003-10-14 The Trustees Of Columbia University In The City Of New York Gene discovery through comparisons of networks of structural and functional relationships among known genes and proteins
US20030195707A1 (en) * 2000-05-25 2003-10-16 Schork Nicholas J Methods of dna marker-based genetic analysis using estimated haplotype frequencies and uses thereof
US7957907B2 (en) * 2001-03-30 2011-06-07 Sorenson Molecular Genealogy Foundation Method for molecular genealogical research
US6909971B2 (en) * 2001-06-08 2005-06-21 Licentia Oy Method for gene mapping from chromosome and phenotype data
US6886015B2 (en) * 2001-07-03 2005-04-26 Eastman Kodak Company Method and system for building a family tree
WO2003009210A1 (en) * 2001-07-18 2003-01-30 Gene Logic, Inc. Methods of providing customized gene annotation reports
US8855935B2 (en) * 2006-10-02 2014-10-07 Ancestry.Com Dna, Llc Method and system for displaying genetic and genealogical data
US20050147947A1 (en) * 2003-12-29 2005-07-07 Myfamily.Com, Inc. Genealogical investigation and documentation systems and methods
US7249129B2 (en) * 2003-12-29 2007-07-24 The Generations Network, Inc. Correlating genealogy records systems and methods
US8285486B2 (en) * 2006-01-18 2012-10-09 Dna Tribes Llc Methods of determining relative genetic likelihoods of an individual matching a population
US8700334B2 (en) * 2006-07-31 2014-04-15 International Business Machines Corporation Methods and systems for reconstructing genomic common ancestors
US8661048B2 (en) * 2007-03-05 2014-02-25 DNA: SI Labs, Inc. Crime investigation tool and method utilizing DNA evidence
US20080228700A1 (en) * 2007-03-16 2008-09-18 Expanse Networks, Inc. Attribute Combination Discovery
WO2009117122A2 (en) * 2008-03-19 2009-09-24 Existence Genetics Llc Genetic analysis
US20110093448A1 (en) * 2008-06-20 2011-04-21 Koninklijke Philips Electronics N.V. System method and computer program product for pedigree analysis
US8413188B2 (en) * 2009-02-20 2013-04-02 At&T Intellectual Property I, Lp System and method for processing image objects in video data
US8224821B2 (en) * 2009-07-28 2012-07-17 Ancestry.Com Operations Inc. Systems and methods for the organized distribution of related data
WO2011025400A1 (en) * 2009-08-30 2011-03-03 Cezary Dubnicki Structured analysis and organization of documents online and related methods
US8185557B2 (en) * 2010-01-27 2012-05-22 Ancestry.Com Operations Inc. Positioning of non-constrained amount of data in semblance of a tree
US8786603B2 (en) * 2011-02-25 2014-07-22 Ancestry.Com Operations Inc. Ancestor-to-ancestor relationship linking methods and systems
US9116882B1 (en) * 2012-08-02 2015-08-25 23Andme, Inc. Identification of matrilineal or patrilineal relatives
US20140067355A1 (en) * 2012-09-06 2014-03-06 Ancestry.Com Dna, Llc Using Haplotypes to Infer Ancestral Origins for Recently Admixed Individuals
WO2014145280A1 (en) * 2013-03-15 2014-09-18 Ancestry.Com Dna, Llc Family networks

Also Published As

Publication number Publication date
NZ739413A (en) 2018-01-26
EP3320469A1 (en) 2018-05-16
AU2021269307A1 (en) 2021-12-09
WO2017006284A1 (en) 2017-01-12
US20170011042A1 (en) 2017-01-12
US10957422B2 (en) 2021-03-23
AU2016290989A1 (en) 2018-02-22
EP3320469A4 (en) 2019-03-06
MX391111B (en) 2022-03-30
MX2018000293A (en) 2018-03-08
CA2991230C (en) 2023-09-26
CA2991230A1 (en) 2017-01-12

Similar Documents

Publication Publication Date Title
US20210183474A1 (en) Genetic and genealogical analysis for identification of birth location and surname information
US11515047B2 (en) Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform
Shah et al. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study
US20160092793A1 (en) Pharmacovigilance systems and methods utilizing cascading filters and machine learning models to classify and discern pharmaceutical trends from social media posts
Ahlgren et al. A nationwide survey of the prevalence of multiple sclerosis in immigrant populations of Sweden
Morgenstern et al. Perspective: Big data and machine learning could help advance nutritional epidemiology
US20140143346A1 (en) Identifying And Classifying Travelers Via Social Media Messages
Chan et al. Reproducible extraction of cross-lingual topics (rectr)
CN112819548A (en) User portrait generation method and device, readable storage medium and electronic equipment
Luyts et al. A Weibull-count approach for handling under-and overdispersed longitudinal/clustered data structures
CN113077312A (en) Hotel recommendation method, system, equipment and storage medium
US9122705B1 (en) Scoring hash functions
US20170075519A1 (en) Data Butler
Liu Examining Nonnormal Latent Variable Distributions for Non-Ignorable Missing Data
Long COVID-19 Real-Time Tracker and Analytical Study
Do et al. Importance bootstrap resampling for proportional hazards regression
Azevedo et al. Estimating hidden populations by transferring knowledge from geographically misaligned levels
Albert et al. Modelling batched Gaussian longitudinal weight data in mice subject to informative dropout

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

AS Assignment

Owner name: ANCESTRY.COM DNA, LLC, UTAH

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KERMANY, AMIR R.;GRANKA, JULIE M.;NOTO, KEITH D.;SIGNING DATES FROM 20170504 TO 20171002;REEL/FRAME:056104/0179

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: WILMINGTON TRUST, NATIONAL ASSOCIATION, AS NOTES COLLATERAL AGENT, DELAWARE

Free format text: PATENT SECURITY AGREEMENT;ASSIGNORS:ANCESTRY.COM DNA, LLC;ANCESTRY.COM OPERATIONS INC.;REEL/FRAME:058536/0278

Effective date: 20211217

Owner name: CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLATERAL AGENT, NEW YORK

Free format text: PATENT SECURITY AGREEMENT;ASSIGNORS:ANCESTRY.COM DNA, LLC;ANCESTRY.COM OPERATIONS INC.;REEL/FRAME:058536/0257

Effective date: 20211217

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED