US20210183474A1 - Genetic and genealogical analysis for identification of birth location and surname information - Google Patents
Genetic and genealogical analysis for identification of birth location and surname information Download PDFInfo
- Publication number
- US20210183474A1 US20210183474A1 US17/184,451 US202117184451A US2021183474A1 US 20210183474 A1 US20210183474 A1 US 20210183474A1 US 202117184451 A US202117184451 A US 202117184451A US 2021183474 A1 US2021183474 A1 US 2021183474A1
- Authority
- US
- United States
- Prior art keywords
- surname
- genetic
- individual
- frequency
- related individuals
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002068 genetic effect Effects 0.000 title claims abstract description 80
- 238000004458 analytical method Methods 0.000 title description 7
- 238000000034 method Methods 0.000 claims description 21
- 102000054766 genetic haplotypes Human genes 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 description 14
- 238000012545 processing Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000000528 statistical test Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 238000007400 DNA extraction Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 210000000349 chromosome Anatomy 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 238000001134 F-test Methods 0.000 description 1
- 238000001358 Pearson's chi-squared test Methods 0.000 description 1
- 238000001801 Z-test Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000001994 activation Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B10/00—ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
Definitions
- This description generally relates to population analyses on human genetic and genealogical information, and particularly to using that information to identify ancestral birth locations or ancestral surnames for an individual.
- Families may have genealogical pedigrees or family trees that may be verbally passed down from generation to generation. However, these genealogical family trees become inaccurate as they are passed along or may be missing the birth location or surnames of past ancestors altogether. Therefore, an individual often cannot rely on genealogical data provided by a family member to identify ancestral birth locations or surnames.
- Described embodiments identify likely birth locations and surnames of an individual's ancestors based on the individual's genotype, genotypes of a population of users who are genetic matches to the individual, and genealogical data (e.g. pedigree or family tree) of those matches. Note that no genealogical data for the individual is necessary for this identification to be performed.
- genealogical data e.g. pedigree or family tree
- a birth location and surname identification system receives a genetic sample from the individual.
- the individual's genetic sample is sequenced and is analyzed to identify users in the system who are genetic matches to the individual. At least some of those genetically matched users will have an associated pedigree that identifies birth locations and/or surnames of their ancestors.
- a computer system determines the frequency of appearance of a birth location or surname amongst the pedigrees of the genetically matched users and further determines whether that frequency of appearance is of statistical significance.
- the system performs a statistical test to prevent recommending birth locations or surnames that may be disproportionally represented. If the frequency of appearance of the birth location or surname is deemed statistically significant, the system may present it to the individual as a recommended ancestral birth location or surname.
- FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment.
- FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic and genealogical data associated with users of the system in accordance with an embodiment.
- FIG. 3 is a flow diagram for the operation of the birth location and surname identification module, in accordance with an embodiment.
- FIG. 4 is a flow diagram for identifying ancestral birthplace or surname identifications for an individual, in accordance with an embodiment.
- FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment. Depicted in FIG. 1 are an individual 101 (i.e. a human or other organism), a DNA extraction service 102 , a birth location and surname identification system 100 , a network 120 , and a client device 160 .
- an individual 101 uses a sample collection kit to provide a DNA sample, e.g., saliva, from which genetic data can be reliably extracted according to conventional DNA processing techniques.
- DNA extraction service 102 receives the sample and estimates genotypes from the genetic data, for example by extracting the DNA from the sample and identifying genotype values of single nucleotide polymorphisms (SNPs) present within the DNA. The result in this example is a diploid genotype for each SNP.
- the birth location and surname identification system 100 receives the genetic data from DNA extraction service 102 and stores the genetic data in a DNA sample store 140 containing DNA diploid genotypes.
- the genetic data stored in the DNA sample store 140 may be associated with the individual 101 in the user data store 145 via one or more pointers.
- Identifying ancestral birth locations or surnames that may be associated with a given individual involves analyzing genealogical information of other individuals that are genetic matches with the individual. To determine the genetic matches, analysis of identity-by-descent (IBD) is used. IBD analysis can be used to identify the familial relationship between any two people (e.g., second cousins) in a population as long as the relationship is due to shared common ancestors from the recent past (e.g., on the order of several hundred years). To date, IBD analysis has not been successfully used to accurately identify ancestral birth locations or surnames from an individual's genetic data.
- IBD identity-by-descent
- the birth location and surname identification system 100 includes an input data processing module 110 that processes the DNA to identify shared segments of DNA data between the individual 101 and a number of other users whose DNA is already stored by the system.
- An IBD estimation module 115 uses the shared segments of DNA are used to identify those other users known in the user data store 145 whose genetic data is stored in the DNA sample store 140 who are matches to the individual.
- the birth location and surname identification module 300 uses the match information to access genealogical data of the individual's 101 genetic matches in order to identify possible surnames and birth locations for the ancestors of the individual 101 .
- the breakdown of the logical functions of the system 100 into the above-introduced modules is for clarity of description only.
- the computer system 100 may comprise more or fewer modules, and the logical structure may be differently organized.
- the data stores may be represented in different ways in different embodiments, such as comma-separated text files, or as databases such as relational databases (SQL) or non-relational databases (NoSQL).
- the network 120 facilitates communications amongst one or more client devices 160 and the system 100 .
- the network 120 may be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet.
- the network 120 uses standard communication technologies and/or protocols. Examples of technologies used by the network 120 include Ethernet, 802.11, 3G, 4G, 802.16, or any other suitable communication technology.
- the network 120 may use wireless, wired, or a combination of wireless and wired communication technologies. Examples of protocols used by the network 120 include transmission control protocol/Internet protocol (TCP/IP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), file transfer protocol (TCP), or any other suitable communication protocol.
- TCP/IP transmission control protocol/Internet protocol
- HTTP hypertext transport protocol
- SMTP simple mail transfer protocol
- TCP file transfer protocol
- a client device 160 is a computing device capable of transmitting and/or receiving data via the network 120 .
- the client device 160 belongs to the individual 101 that provided the genetic sample.
- client devices 160 include desktop computers, laptop computers, tablet computers (pads), mobile phones, personal digital assistants (PDAs), gaming devices, or any other electronic device including computing functionality and data communication capabilities.
- the client device 160 may use a web browser 180 , such as Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, Apple Safari and/or Opera, as an interface to connect with the network 120 . Additionally or alternatively, specialized application software 180 that runs native on a mobile device is used as an interface to connect to the network 120 .
- the birth location and surname identification system 100 sends, through the network 120 , a list of surnames or birthplaces to the client device 160 identified by system 100 for presentation to the individual 101 .
- the list of surnames or birthplaces may be presented by the client device 160 on a user interface or a display screen.
- a new user to the system 100 who is submitting their DNA among other data will activate a new account, often through graphical user interface (GUI) provided through a mobile software application or a web-based interface.
- GUI graphical user interface
- the system 100 receives one or more types of basic personal information about the individual 101 such as age, date of birth, geographical location of birth (e.g., city, state, county, country, hospital, etc.), complete name including first, last middle names as well as any suffixes, and gender.
- This received user information is stored in the user data store 145 , in association with the corresponding DNA samples stored in the DNA sample store 140 .
- the computing system 100 comprises an input data processing module 110 , and an IBD estimation module 115 . These modules are described in relation with FIG. 2 which is a flow diagram for the operation of the computer system 100 for estimating and storing estimated IBD in accordance with an embodiment.
- FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic, genealogical and survey input data.
- the input data processing module 110 is responsible for receiving, storing and processing data received from an individual 101 via the DNA extraction service 102 .
- the input data processing module 110 includes a DNA collection module 210 , a genealogical collection module 220 , genotype identification module 240 , and a genotype phasing module 250 .
- the DNA collection module 210 is responsible for receiving sample data from external sources (e.g., extraction service 102 ), processing and storing the samples in the DNA sample store 140 .
- the data stored in the DNA sample store 140 may store one or more received samples DNA linked to a user as a ⁇ key, value> pair associated with the individual 101 .
- the ⁇ key, value> pair is ⁇ sampleID, “GA TC TC AA”>.
- the data stored in the DNA sample store 140 may be identified by one or more keys used to index one or more values associated with an individual 101 .
- keys are a userID and sampleID, or alternatively another ⁇ key, value> pair is ⁇ userID, sampleID>.
- the DNA sample store 140 stores a pointer to a location associated with the user data store 145 associated with the individual 101 .
- the user data store 145 will be further described below.
- the genealogical collection module 220 both receives and processes genealogical data and stores the data in the user data store 145 . This data may be received for the individual 101 , and may have been received in the past for other users of the system, some of whom may be determined to be genetic matches to the individual 101 .
- the genealogical data may include a variety of different types of information.
- the genealogical data can take the form of a pedigree of a user (e.g., the recorded relationships in a family).
- the genealogical collection module 220 may be configured to provide an interactive GUI that asks the user questions or provides a menu of options, and receives user input that can be processed to obtain the genealogical data.
- Examples of genealogical data that may be collected include, but are not limited to, names (first, last, middle, suffixes), birth locations (e.g., county, city, state, country, hospital, global map coordinates), date of birth, date of death, marriage information, family relations (manually provided rather than genetically identified), etc.
- OCR optical character recognition
- the pedigree information associated with a user may include a genealogical graph.
- the genealogical graph may include one or more specified nodes. Each specified node in the genealogical graph represents either the user or an ancestor of the user that could have passed down genetic material to the user.
- the pedigree information provided by users may or may not be accurate or complete.
- the genealogical collection module 220 is responsible for filtering the received pedigree data based on one or more quality criteria in an effort to discard lower quality genealogical data. For example, the genealogical collection module 220 may filter the received pedigree data by excluding all pedigree nodes associated with a stored DNA sample that do not satisfy all of the following criteria: (1) recorded death date for a the linked pedigree node corresponds to official records (when available), (2) the gender is the same as the gender provided by the user; and (3) the birth date is within 3 years of the birth date provided by the user. The user may be prompted via GUI to resolve any discrepancies identified by module 220 . In some embodiments, all received genealogical data marked as “private” are excluded from the any subsequent analysis to ensure that any privacy requirements imposed on the data are met.
- the genotype identification module 240 accesses the collected DNA data from the DNA collection module 210 or the sample store 140 and identifies autosomal SNPs so that the individual's diploid genotype on autosomal chromosomes can be computationally phased.
- the genotype identification module 240 provides the identified SNPs to the genotype phasing module 250 which phases the individual's diploid genotype based on the set of identified SNPs.
- the genotype phasing module 250 generates a pair of estimated haplotypes for each diploid genotype.
- the estimated haplotypes are then stored in the user data store 145 in association with the individual 101 , and may also be stored in association with or verified against the genotypes of the individual's parents, who may also have their own separate accounts in the system 100 .
- a variety of different computational phasing techniques may be used including, for example, the techniques described in U.S. Patent Application No. 2016/061,568, filed on Jan. 17, 2014, which is hereby incorporated by reference in its entirety.
- the phasing module 250 stores phased genotypes in the user data store 145 .
- the IBD estimation module 115 is responsible for identifying IBD segments (also referred to as IBD estimates) from phased genotype data (haplotypes) between the individual 101 and a user stored in the user data store 145 .
- IBD segments are chromosome segments identified in the individual 101 and a user that are putatively inherited from a recent common ancestor.
- an individual 101 and a user who are closely related share a relatively large number of IBD segments, and the IBD segments tend to have greater length (individually or in aggregate across one or more chromosomes).
- an individual 101 and a user who are more distantly related share relatively few IBD segments, and these segments tend to be shorter (individually or in aggregate across one or more chromosomes).
- the IBD estimation algorithm used by the IBD estimation module 115 to estimate (or infer) IBD segments between an individual 101 and a user is as described in U.S. patent application Ser. No. 14/029,765, filed on Sep. 17, 2013, which is hereby incorporated by reference in its entirety. Another further processing step may be performed on these inferred IBD segments by applying the technique described in PCT Patent Application No. PCT/US2015/055579, filed on Oct. 14, 2015, which is hereby incorporated by reference in its entirety.
- the identified IBD segments are stored in the user data store 145 in association with the individual 101 .
- the IBD estimation module 115 is configured to estimate IBD segments between the individual 101 and large numbers of users stored in the user data store 145 .
- the computing system has been optimized to efficiently handle large amounts of IBD data. Said another way, IBD is estimated across a large number of individuals based on their DNA.
- the IBD estimation module 115 (and computing system 100 generally) distributes IBD computations over a Hadoop computing cluster, internal to or external from computing system 100 , and stores the phased genotypes used in the IBD computations in a database so that IBD estimates for new accounts/individuals can be quickly compared to previously processed individuals.
- FIG. 3 depicts the birth location and surname identification module 300 , in accordance with an embodiment.
- the birth location and surname identification module 300 includes a genetic match module 305 , a location frequency calculator 310 , a surname frequency calculator 315 , a statistical analysis module 320 , an enrichment score module 325 , a list generation module 330 , a location frequency store 350 , a surname frequency store 360 , a location score store 380 , and a surname score store 390 .
- the genetic match module 305 retrieves the IBD estimates between the individual 101 and the users in the user data store 145 and determines whether the individual 101 and any given other user are a genetic match.
- the individual 101 and a user are a match if they have higher than a threshold amount of IBD segment sharing, as determined by the IBD estimation module 115 .
- a match may indicate that the individual 101 and the user are related (e.g. parent/child, sibling, aunt/uncle, first cousin, first cousin once removed, second cousin, second cousin once removed).
- the genetic match module 305 identifies all users in the user data store 145 that are considered matches to the individual 101 .
- the set of user matches is referred to herein by M u . In an example, the number of matches is limited to the top 3000 matches (i.e.
- the genetic match module 305 provides the set of user genetic matches, M u , to the location frequency calculator 310 to determine an identification of possible birth locations associated with the user.
- the location frequency calculator 310 determines how frequently a particular birth location appears amongst the pedigrees of the users within the set of user matches, M u . To do this, the location frequency calculator 310 retrieves, for each matching user in the set of user matches M u , the matching user's pedigree.
- the pedigree includes a genealogical graph from the user data store 145 .
- a genealogical graph in the pedigree may be the matching user's family tree that describes the relationship between the matching user and each of the matching user's relatives.
- Each relative in the matching user's pedigree has associated genealogical data such as the relative's birth location.
- T v denotes a set of birth locations indicated in matching user v's pedigree.
- the location frequency calculator 310 identifies the set of birth locations, T v , in the matching user's pedigree.
- the location frequency calculator 310 may identify a matching user v as having 10 relatives born in New York City, 2 relatives born in Boston, and 2 relatives born in Los Angeles. Therefore, the elements in T v include New York City, Boston, and Los Angeles.
- a presence indicator a v,i may be represented by the indicator function representing whether a birth location i is indicated in matching user v's pedigree:
- a v , 1 ⁇ 1 if ⁇ ⁇ i ⁇ T v 0 otherwise , ( 1 )
- matching user v has an indicator function score of 1 for birth locations of New York City, Boston, and Los Angeles. All other birth locations (e.g. Washington D.C., elsewhere) would have a presence indicator and corresponding indicator function score of 0.
- the location frequency calculator 310 repeats this process for the set of matches M u .
- the total number of pedigrees (m i ) of users in the set of matches that have this birth location is determined according to:
- the location frequency calculator 310 summates the indicator function score for each birth location.
- the maximum number of pedigrees (max(m i )) that have the birth location is 1000, which would occur if every user in the set of matches M u has the birth location in their pedigree.
- the location frequency calculator 310 uses the total number of pedigrees m i , to determine p i , the match frequency of a birth location i, where p i is determined according to:
- the match frequency of a birth location represents how often matching users in the set of matches M u are associated with the birth location, which can be used as a way of determining an association between the ancestors of the individual 101 and the birth location. This match frequency is stored in the location frequency store 350 .
- the location frequency calculator 310 also calculates a background frequency for each birth location i.
- the background frequency of a birth location provides an indication as to how often the birth location appears amongst the greater population of users stored in the system, including those who are not matches to the individual. For example, high population cities such as New York City or Boston may have higher background frequencies than smaller cities such as Cheyenne, Wyo.
- D represents the total set of users in the system. Generally the number of users in set D is significantly larger (e.g. multiple orders of magnitude larger) than the number of matching users in the set of matches M u . Each user in D may have a corresponding pedigree. Altogether, this forms the set of all pedigrees stored in the user data store 145 .
- the location frequency calculator 310 may use a similar indicator function as was previously shown in equation (1) to calculate whether a birth location i, exists in the pedigree corresponding to user w in the set D:
- a w , i ⁇ 1 if ⁇ ⁇ birth ⁇ ⁇ location , i , exists ⁇ ⁇ in ⁇ ⁇ user ⁇ ⁇ w ′ ⁇ s ⁇ ⁇ pedigree 0 otherwise ( 4 )
- the location frequency calculator 310 summates the total number of pedigrees that each have the birth location, each pedigree corresponding to a user w in the set of D. To calculate the background frequency, the location frequency calculator 310 divides the summated total number of users in the set of D that have the birth location by the total number of users in the set of D. Therefore, the background frequency of a birth location, i, is expressed in equation 5 as
- This background frequency is stored in the location frequency store 350 .
- the surname frequency calculator 315 calculates a match frequency and background frequency for each surname in the pedigrees of matching users in a similar fashion as was discussed for birth locations in section III.b.
- the surname frequency calculator 315 receives the set of user matches, M u , from the genetic match module 305 for an individual 101 and determines a match frequency p j ,that represents how often a given surname, j (e.g. Bradley, O'Malley, Johnson), appears amongst the pedigrees of users, v, in the set of user matches, M u .
- j e.g. Bradley, O'Malley, Johnson
- the surname frequency calculator also calculates the background frequency for the surname “Bradley” in the total set of users in the system, D.
- the match frequency, p j , and background frequency, q j may each be stored in the surname frequency store 360 .
- the surname frequency calculator 315 may first normalize, meaning that the surname frequency calculator 315 may consider many alternate spellings as being the same surname for purposes of frequency calculations. Examples of such alternate spellings may include use of characters not used in English (e.g., “o” versus “ ⁇ ”); capitalization, punctuation, and spacing (“O'Malley” versus “Omalley”); suffixes (“Jr.”); and commentary (“Johnson (WWII Veteran)”). A simple normalization is performed that ignores capitalization and punctuation, and removes commentary, thereby reducing the set of surnames under consideration. Alternate spellings and misspellings may be interpreted by the surname frequency calculator 315 as a different surname.
- the statistical analysis module 320 identifies which birth locations and surnames are sufficiently notable for the individual 101 under consideration so as to merit possibly providing to the individual 101 as likely being associated with their own ancestors.
- M u there may be a total of 1000 users in the set of matches, M u .
- a match frequency, p i of 10% may appear to be a very high number of appearances for a birth location.
- the background frequency, q i is also close to 10%, meaning that the birth location appears approximately equally frequently in the pedigrees of all users in the system, then a match frequency, p i , of 10% may not be sufficiently notable to be worth identifying as associated with the individual.
- the statistical analysis module 320 receives the match frequency, p i , and background frequency, q i , for all different birth locations, i, from the location frequency calculator 310 .
- the statistical analysis module 320 conducts a statistical analysis test to determine whether the match frequency of a given birth location is sufficiently notable. For each birth location, i, the statistical analysis module 320 determines the likelihood of observing the received match frequency, p i , and background frequency, q i under a null hypothesis H 0 scenario.
- the alternative hypothesis H 1 is the assumed scenario where the match frequency and background frequency are non-equal (i.e. p i ⁇ q i ), with the assumption that if, particularly, p i >q i , then p i , and thus i, may be statistically significant and therefore worth possibly providing to the individual.
- the statistical analysis module 320 determines the likelihood of observing the received match frequency and background frequency under the assumption that the match frequency, p i , and background frequency, q i , are equal. However, if the received match frequency is sufficiently larger than the received background frequency, then the null hypothesis H 0 is rejected in favor of the alternative hypothesis H 1 . What constitutes a sufficient difference between the received match frequency, p i , and background frequency, q i , will be discussed further below in regards to the summary statistic S i .
- a similar calculation may be performed for surnames by receiving the match frequency, p j , and background frequency, q j , for all surname identifications, j, from the surname frequency calculator 315 .
- the subsequent discussion focuses on conducting a statistical test for a birth location, i. This discussion may also refer to conducting a statistical test for a surname.
- the statistical test is performed under a null hypothesis H 0 , the assumed scenario where the match frequency, p i , and the background frequency, q i , are equal.
- the statistical analysis module 320 conducts a maximum likelihood ratio test.
- the statistical analysis test may be a Pearson's chi-squared test, a Z-test, or a F-test.
- the test statistic, ⁇ , for the maximum likelihood ratio test is determined according to:
- H 1 ) denotes the likelihood of observing m i under the alternative hypothesis when varying p between 0 and 1.
- the test statistic is a ratio between a first likelihood of observing the match frequency and background frequency under the null hypothesis and a second likelihood of observing the match frequency and background frequency under the alternative hypothesis.
- a summary statistic, S i is determined using ⁇ according to:
- the statistical analysis module 320 calculates a summary statistic for each birth location i. Note that if the match frequency, p i , and background frequency, q i , for a birth location received from the location frequency calculator 310 are equal, then the value of the summary statistic is zero. Additionally, the summary statistic S i increases in magnitude as the difference between the match frequency, p i , and background frequency, q i , increases in magnitude.
- the statistical analysis model 320 may calculate the p-value for rejecting the null hypothesis H 0 based on the first order chi-squared distribution of the summary statistic, S i .
- the null hypothesis is rejected if S i >4 (or 2S i >8) based on the first order chi-squared distribution.
- S i >4 or 2S i >8 based on the first order chi-squared distribution.
- the match frequency is sufficiently larger than the background frequency for a particular birthplace or surname such that the summary statistic S i >4, then the alternative hypothesis (i.e. where the match frequency does not equal the background frequency) is accepted. This indicates that the particular birthplace or surname is sufficiently notable to be associated with the ancestors of the individual 101 .
- the exact value of the significance level may vary by implementation, or according to more specific factors. Also, although the above embodiment describes the significance level as being a p-value, in practice it may be any threshold which determines whether or not a particular birth location i or surname j is sufficiently statistically significant to merit consideration for providing to the individuals.
- the statistical analysis module 320 may adjust the significance level (e.g., p-value) for a birth location i, based on the country of origin of the birth location.
- the birth location i from a particular country of origin is determined based on the latitude and longitudinal coordinates associated with the birth location. More specifically, the particular significance level for a country of origin is chosen based on the number of users in the database associated with those countries in their respective pedigrees and the number of matches that a given individual has in the database that are annotated with a pedigree attached to them.
- a birth location that derives from a country having a large number of users associated with that country may utilize a relatively high significance level (e.g., 0.995), whereas a birth location that derives from a country having relatively few users associated with that country (e.g., Mexico, Russia, Eastern Europe) may utilize a relatively lower significance level (e.g., 0.9).
- a relatively high significance level e.g. 0.95
- a birth location that derives from a country having relatively few users associated with that country e.g., Mexico, Russia, Eastern Europe
- a relatively lower significance level e.g., 0.9
- the threshold needed to determine whether the difference between the birth location match frequency versus the corresponding background frequency is statistically significant may vary. This allows the module 300 to better take into account the relative availability of data regarding a particular country in determining whether or not particular birth locations are statistically significant.
- the statistical analysis module 320 determines which birth locations and surnames are statistically significant given the information known about them from the underlying pedigree data from users genetically matched to an individual 101 .
- the enrichment score module 325 uses this binary determination of statistical significance to determine an enrichment score representing a strength of association between the birth location or surname and the ancestors of the individual 101 .
- the enrichment score module 325 determines an enrichment score, x i , for each birth location, i, or enrichment score, x j , for each surname, j.
- the enrichment score module 325 receives the summary statistic, S i , for each birth location or the summary statistic, S j , for each surname. Additionally, the enrichment score module 325 receives the match frequency, p i or p j , and the background frequency, q i and q j , for birth locations and surnames.
- the enrichment score module 325 calculates the enrichment score to be:
- the exact form of the calculation may vary in practice, particularly the significance level may vary by country as described above. Note that if the match frequency, p i , and the background frequency, q i , are not significantly different, then the enrichment score is close to zero, indicating that the particular birth location may not be very relevant to ancestors of the individual 101 . Scaling the match frequency by a factor of log p/q eliminates biases towards highly popular birth locations and surnames because they are likely to have a high background frequency (high q) as well, thereby reducing the enrichment score.
- the enrichment score module 325 calculates the enrichment score for a surname, j, in the same manner according to equation 10. In another embodiment, the enrichment score module 325 calculates the enrichment score for a surname, j, as
- the respective enrichment scores for a birth location and surname are stored in the location score store 380 and surname score store 390 , respectively.
- the enrichment score module 325 provides the enrichment score associated with each birth location or surname to the list generation module 330 .
- the list generation module 330 may rank and/or provide one or more birth locations or surnames to the individual through client device 160 based on their associated enrichment scores. Exactly how the list generation module 330 provides birth locations and surnames, how many, and in what form (e.g., lists, etc.) may vary by implementation.
- the list generation module 330 may set a minimum threshold in order for a particular birth location or surname to be recommended. For example, a birth location or surname may have to meet a certain minimum threshold enrichment score, and/or must appear in at least some number of pedigrees (e.g., m i ⁇ 3) for it to be recommended.
- the list generation module 330 generates a list 370 including only the top N birth locations or surnames by enrichment score.
- the list 370 is sent through the network 120 to the client device 160 for consumption by the individual 101 .
- FIG. 4 illustrates a process of providing an identified birth location or surname to an individual, according to one embodiment.
- the birth location and surname identification system 100 receives 405 a sequence of genetic data from an individual 101 .
- the system 100 identifies users in the system 100 that are genetic matches with the individual 101 . This may be accomplished by identifying DNA segment matches on the pair of haplotypes for the individual and the pair of haplotypes for users retrieved from the user data store 145 .
- Each user in the system 100 has a corresponding pedigree stored in the user data store 145 . From the set of all pedigrees in the user data store 145 , a subset of matching pedigrees is identified. Each pedigree in the subset of matching pedigrees is associated with a user that is a genetic match of the individual 101 . The system 100 determines 420 a match frequency, p, of the birth location or surname amongst the subset of matching pedigrees. Additionally, the system 100 determines 425 a background frequency, q, of the birth location or surname amongst the set of all pedigrees.
- the system 100 identifies 430 the likelihood of observing the match frequency and background frequency for a birth location or surname under the assumed scenario that the match frequency and background frequency are equal.
- the statistical analysis module 320 of the system 100 conducts a statistical test and determines 435 an enrichment score for each birth location or surname. Based on the statistical test and the enrichment score, the system may provide 440 the birth location or surname to the individual 101 .
- the birth location and surname identification system 100 is implemented using one or more computers having one or more processors executing application code to perform the steps described herein, and data may be stored on any conventional non-transitory storage medium and, where appropriate, include a conventional database server implementation.
- any conventional non-transitory storage medium and, where appropriate, include a conventional database server implementation.
- various components of a computer system for example, processors, memory, input devices, network devices and the like are not shown in FIG. 1 .
- a distributed computing architecture is used to implement the described features.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
- This application is a continuation of U.S. application Ser. No. 15/203,776, filed on Jul. 6, 2016, which claims the benefit of U.S. Provisional Application No. 62/189,422, filed Jul. 7, 2015, both of which are incorporated by reference in their entirety for all purposes.
- This description generally relates to population analyses on human genetic and genealogical information, and particularly to using that information to identify ancestral birth locations or ancestral surnames for an individual.
- An individual may often be interested in learning more about his/her ancestral history including ancestral birth locations and/or ancestral surnames. Families may have genealogical pedigrees or family trees that may be verbally passed down from generation to generation. However, these genealogical family trees become inaccurate as they are passed along or may be missing the birth location or surnames of past ancestors altogether. Therefore, an individual often cannot rely on genealogical data provided by a family member to identify ancestral birth locations or surnames.
- Described embodiments identify likely birth locations and surnames of an individual's ancestors based on the individual's genotype, genotypes of a population of users who are genetic matches to the individual, and genealogical data (e.g. pedigree or family tree) of those matches. Note that no genealogical data for the individual is necessary for this identification to be performed.
- In one embodiment, to generate identifications of possible ancestral birth locations and/or surnames for an individual, a birth location and surname identification system receives a genetic sample from the individual. The individual's genetic sample is sequenced and is analyzed to identify users in the system who are genetic matches to the individual. At least some of those genetically matched users will have an associated pedigree that identifies birth locations and/or surnames of their ancestors. A computer system determines the frequency of appearance of a birth location or surname amongst the pedigrees of the genetically matched users and further determines whether that frequency of appearance is of statistical significance. In various embodiments, the system performs a statistical test to prevent recommending birth locations or surnames that may be disproportionally represented. If the frequency of appearance of the birth location or surname is deemed statistically significant, the system may present it to the individual as a recommended ancestral birth location or surname.
- These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, and accompanying drawings, where:
-
FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment. -
FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic and genealogical data associated with users of the system in accordance with an embodiment. -
FIG. 3 is a flow diagram for the operation of the birth location and surname identification module, in accordance with an embodiment. -
FIG. 4 is a flow diagram for identifying ancestral birthplace or surname identifications for an individual, in accordance with an embodiment. - Note that for purposes of clarity, only one of each item corresponding to a reference numeral is included in most figures, but when implemented, multiple instances of any or all of the depicted modules may be employed, as will be appreciated by those of skill in the art.
-
FIG. 1 is a block diagram of an overview of a computing system for identifying an ancestral birth location or surname to an individual, according to one embodiment. Depicted inFIG. 1 are an individual 101 (i.e. a human or other organism), aDNA extraction service 102, a birth location andsurname identification system 100, anetwork 120, and aclient device 160. - Individual 101 provides a DNA sample for analysis. In one embodiment, an individual 101 uses a sample collection kit to provide a DNA sample, e.g., saliva, from which genetic data can be reliably extracted according to conventional DNA processing techniques.
DNA extraction service 102 receives the sample and estimates genotypes from the genetic data, for example by extracting the DNA from the sample and identifying genotype values of single nucleotide polymorphisms (SNPs) present within the DNA. The result in this example is a diploid genotype for each SNP. The birth location andsurname identification system 100 receives the genetic data fromDNA extraction service 102 and stores the genetic data in aDNA sample store 140 containing DNA diploid genotypes. The genetic data stored in theDNA sample store 140 may be associated with the individual 101 in the user data store 145 via one or more pointers. - Identifying ancestral birth locations or surnames that may be associated with a given individual involves analyzing genealogical information of other individuals that are genetic matches with the individual. To determine the genetic matches, analysis of identity-by-descent (IBD) is used. IBD analysis can be used to identify the familial relationship between any two people (e.g., second cousins) in a population as long as the relationship is due to shared common ancestors from the recent past (e.g., on the order of several hundred years). To date, IBD analysis has not been successfully used to accurately identify ancestral birth locations or surnames from an individual's genetic data.
- To perform IBD, the birth location and
surname identification system 100 includes an inputdata processing module 110 that processes the DNA to identify shared segments of DNA data between the individual 101 and a number of other users whose DNA is already stored by the system. AnIBD estimation module 115 uses the shared segments of DNA are used to identify those other users known in the user data store 145 whose genetic data is stored in theDNA sample store 140 who are matches to the individual. The birth location andsurname identification module 300 uses the match information to access genealogical data of the individual's 101 genetic matches in order to identify possible surnames and birth locations for the ancestors of the individual 101. - Each of these modules is described in further detail below. The breakdown of the logical functions of the
system 100 into the above-introduced modules is for clarity of description only. In other embodiments, thecomputer system 100 may comprise more or fewer modules, and the logical structure may be differently organized. The data stores may be represented in different ways in different embodiments, such as comma-separated text files, or as databases such as relational databases (SQL) or non-relational databases (NoSQL). - The
network 120 facilitates communications amongst one ormore client devices 160 and thesystem 100. Thenetwork 120 may be any wired or wireless local area network (LAN) and/or wide area network (WAN), such as an intranet, an extranet, or the Internet. In various embodiments, thenetwork 120 uses standard communication technologies and/or protocols. Examples of technologies used by thenetwork 120 include Ethernet, 802.11, 3G, 4G, 802.16, or any other suitable communication technology. Thenetwork 120 may use wireless, wired, or a combination of wireless and wired communication technologies. Examples of protocols used by thenetwork 120 include transmission control protocol/Internet protocol (TCP/IP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), file transfer protocol (TCP), or any other suitable communication protocol. - A
client device 160 is a computing device capable of transmitting and/or receiving data via thenetwork 120. In various embodiments, theclient device 160 belongs to the individual 101 that provided the genetic sample. Examples ofclient devices 160 include desktop computers, laptop computers, tablet computers (pads), mobile phones, personal digital assistants (PDAs), gaming devices, or any other electronic device including computing functionality and data communication capabilities. Theclient device 160 may use aweb browser 180, such as Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, Apple Safari and/or Opera, as an interface to connect with thenetwork 120. Additionally or alternatively,specialized application software 180 that runs native on a mobile device is used as an interface to connect to thenetwork 120. - The birth location and
surname identification system 100 sends, through thenetwork 120, a list of surnames or birthplaces to theclient device 160 identified bysystem 100 for presentation to the individual 101. The list of surnames or birthplaces may be presented by theclient device 160 on a user interface or a display screen. - Although not necessarily a part of any particular illustrated module, a new user to the
system 100 who is submitting their DNA among other data will activate a new account, often through graphical user interface (GUI) provided through a mobile software application or a web-based interface. As part of the account activation process, thesystem 100 receives one or more types of basic personal information about the individual 101 such as age, date of birth, geographical location of birth (e.g., city, state, county, country, hospital, etc.), complete name including first, last middle names as well as any suffixes, and gender. This received user information is stored in the user data store 145, in association with the corresponding DNA samples stored in theDNA sample store 140. - To process the data stored in the
DNA sample store 140 and estimate IBD from the DNA samples, thecomputing system 100 comprises an inputdata processing module 110, and anIBD estimation module 115. These modules are described in relation withFIG. 2 which is a flow diagram for the operation of thecomputer system 100 for estimating and storing estimated IBD in accordance with an embodiment. - II.a. DNA Sample Receipt and Account Creation
-
FIG. 2 is a flow diagram for the operation of the computer system for receiving, processing and storing genetic, genealogical and survey input data. The inputdata processing module 110 is responsible for receiving, storing and processing data received from an individual 101 via theDNA extraction service 102. The inputdata processing module 110 includes aDNA collection module 210, agenealogical collection module 220, genotype identification module 240, and agenotype phasing module 250. - The
DNA collection module 210 is responsible for receiving sample data from external sources (e.g., extraction service 102), processing and storing the samples in theDNA sample store 140. The data stored in theDNA sample store 140 may store one or more received samples DNA linked to a user as a <key, value> pair associated with the individual 101. In one instance, the <key, value> pair is <sampleID, “GA TC TC AA”>. The data stored in theDNA sample store 140 may be identified by one or more keys used to index one or more values associated with an individual 101. In one example, keys are a userID and sampleID, or alternatively another <key, value> pair is <userID, sampleID>. In various embodiments, theDNA sample store 140 stores a pointer to a location associated with the user data store 145 associated with the individual 101. The user data store 145 will be further described below. - II.b. Genealogical Data
- The
genealogical collection module 220 both receives and processes genealogical data and stores the data in the user data store 145. This data may be received for the individual 101, and may have been received in the past for other users of the system, some of whom may be determined to be genetic matches to the individual 101. - The genealogical data may include a variety of different types of information. The genealogical data can take the form of a pedigree of a user (e.g., the recorded relationships in a family). To collect the data, the
genealogical collection module 220 may be configured to provide an interactive GUI that asks the user questions or provides a menu of options, and receives user input that can be processed to obtain the genealogical data. Examples of genealogical data that may be collected include, but are not limited to, names (first, last, middle, suffixes), birth locations (e.g., county, city, state, country, hospital, global map coordinates), date of birth, date of death, marriage information, family relations (manually provided rather than genetically identified), etc. These data may be manually provided or automatically extracted via, for example, optical character recognition (OCR) performed on census records, town or government records, or any other item of printed or online material. - The pedigree information associated with a user may include a genealogical graph. For example, the genealogical graph may include one or more specified nodes. Each specified node in the genealogical graph represents either the user or an ancestor of the user that could have passed down genetic material to the user.
- The pedigree information provided by users may or may not be accurate or complete. The
genealogical collection module 220 is responsible for filtering the received pedigree data based on one or more quality criteria in an effort to discard lower quality genealogical data. For example, thegenealogical collection module 220 may filter the received pedigree data by excluding all pedigree nodes associated with a stored DNA sample that do not satisfy all of the following criteria: (1) recorded death date for a the linked pedigree node corresponds to official records (when available), (2) the gender is the same as the gender provided by the user; and (3) the birth date is within 3 years of the birth date provided by the user. The user may be prompted via GUI to resolve any discrepancies identified bymodule 220. In some embodiments, all received genealogical data marked as “private” are excluded from the any subsequent analysis to ensure that any privacy requirements imposed on the data are met. - II.c. Processing and Phasing DNA Samples
- The genotype identification module 240 accesses the collected DNA data from the
DNA collection module 210 or thesample store 140 and identifies autosomal SNPs so that the individual's diploid genotype on autosomal chromosomes can be computationally phased. The genotype identification module 240 provides the identified SNPs to thegenotype phasing module 250 which phases the individual's diploid genotype based on the set of identified SNPs. Thegenotype phasing module 250 generates a pair of estimated haplotypes for each diploid genotype. The estimated haplotypes are then stored in the user data store 145 in association with the individual 101, and may also be stored in association with or verified against the genotypes of the individual's parents, who may also have their own separate accounts in thesystem 100. A variety of different computational phasing techniques may be used including, for example, the techniques described in U.S. Patent Application No. 2016/061,568, filed on Jan. 17, 2014, which is hereby incorporated by reference in its entirety. Thephasing module 250 stores phased genotypes in the user data store 145. - II.d. IBD Estimation
- The
IBD estimation module 115 is responsible for identifying IBD segments (also referred to as IBD estimates) from phased genotype data (haplotypes) between the individual 101 and a user stored in the user data store 145. IBD segments are chromosome segments identified in the individual 101 and a user that are putatively inherited from a recent common ancestor. Typically, an individual 101 and a user who are closely related share a relatively large number of IBD segments, and the IBD segments tend to have greater length (individually or in aggregate across one or more chromosomes). Alternatively, an individual 101 and a user who are more distantly related share relatively few IBD segments, and these segments tend to be shorter (individually or in aggregate across one or more chromosomes). - In one embodiment, the IBD estimation algorithm used by the
IBD estimation module 115 to estimate (or infer) IBD segments between an individual 101 and a user is as described in U.S. patent application Ser. No. 14/029,765, filed on Sep. 17, 2013, which is hereby incorporated by reference in its entirety. Another further processing step may be performed on these inferred IBD segments by applying the technique described in PCT Patent Application No. PCT/US2015/055579, filed on Oct. 14, 2015, which is hereby incorporated by reference in its entirety. The identified IBD segments are stored in the user data store 145 in association with the individual 101. - The
IBD estimation module 115 is configured to estimate IBD segments between the individual 101 and large numbers of users stored in the user data store 145. In some embodiments of this module, the computing system has been optimized to efficiently handle large amounts of IBD data. Said another way, IBD is estimated across a large number of individuals based on their DNA. For example, in one implementation, the IBD estimation module 115 (andcomputing system 100 generally) distributes IBD computations over a Hadoop computing cluster, internal to or external fromcomputing system 100, and stores the phased genotypes used in the IBD computations in a database so that IBD estimates for new accounts/individuals can be quickly compared to previously processed individuals. - III.a. Identifying Genetic Matches
-
FIG. 3 depicts the birth location andsurname identification module 300, in accordance with an embodiment. The birth location andsurname identification module 300 includes agenetic match module 305, alocation frequency calculator 310, asurname frequency calculator 315, astatistical analysis module 320, anenrichment score module 325, alist generation module 330, alocation frequency store 350, asurname frequency store 360, alocation score store 380, and asurname score store 390. - The
genetic match module 305 retrieves the IBD estimates between the individual 101 and the users in the user data store 145 and determines whether the individual 101 and any given other user are a genetic match. In one embodiment, the individual 101 and a user are a match if they have higher than a threshold amount of IBD segment sharing, as determined by theIBD estimation module 115. A match may indicate that the individual 101 and the user are related (e.g. parent/child, sibling, aunt/uncle, first cousin, first cousin once removed, second cousin, second cousin once removed). Thegenetic match module 305 identifies all users in the user data store 145 that are considered matches to the individual 101. The set of user matches is referred to herein by Mu. In an example, the number of matches is limited to the top 3000 matches (i.e. |Mu|≤3000) sorted based on amount of IBD segment sharing. - III.b. Match and Background Frequency for a Birth Location
- The
genetic match module 305 provides the set of user genetic matches, Mu, to thelocation frequency calculator 310 to determine an identification of possible birth locations associated with the user. Thelocation frequency calculator 310 determines how frequently a particular birth location appears amongst the pedigrees of the users within the set of user matches, Mu. To do this, thelocation frequency calculator 310 retrieves, for each matching user in the set of user matches Mu, the matching user's pedigree. The pedigree includes a genealogical graph from the user data store 145. For example, a genealogical graph in the pedigree may be the matching user's family tree that describes the relationship between the matching user and each of the matching user's relatives. Each relative in the matching user's pedigree has associated genealogical data such as the relative's birth location. - For a given matching user v ∈ Mu, Tv denotes a set of birth locations indicated in matching user v's pedigree. For each matching user, the
location frequency calculator 310 identifies the set of birth locations, Tv, in the matching user's pedigree. For example, thelocation frequency calculator 310 may identify a matching user v as having 10 relatives born in New York City, 2 relatives born in Boston, and 2 relatives born in Los Angeles. Therefore, the elements in Tv include New York City, Boston, and Los Angeles. A presence indicator av,i, may be represented by the indicator function representing whether a birth location i is indicated in matching user v's pedigree: -
- Thus, for this example, matching user v has an indicator function score of 1 for birth locations of New York City, Boston, and Los Angeles. All other birth locations (e.g. Washington D.C., elsewhere) would have a presence indicator and corresponding indicator function score of 0.
- The
location frequency calculator 310 repeats this process for the set of matches Mu. For a given birth location, i, the total number of pedigrees (mi) of users in the set of matches that have this birth location is determined according to: -
- For example, the
location frequency calculator 310 summates the indicator function score for each birth location. Thus, if the number of matching users is 1000, the maximum number of pedigrees (max(mi)) that have the birth location is 1000, which would occur if every user in the set of matches Mu has the birth location in their pedigree. - The
location frequency calculator 310 uses the total number of pedigrees mi, to determine pi, the match frequency of a birth location i, where pi is determined according to: -
p i =m i/|M u| (3) - where |Mu| is the total number of matching users in the set Mu. Returning to the previous example, if all 1000 users in the set of matches Mu (e.g. |Mu|=1000) had the birth location of New York City in their pedigree (e.g. mi=1000), then the match frequency of New York City is pNew York City=1. Therefore, the match frequency of a birth location represents how often matching users in the set of matches Mu are associated with the birth location, which can be used as a way of determining an association between the ancestors of the individual 101 and the birth location. This match frequency is stored in the
location frequency store 350. - The
location frequency calculator 310 also calculates a background frequency for each birth location i. The background frequency of a birth location provides an indication as to how often the birth location appears amongst the greater population of users stored in the system, including those who are not matches to the individual. For example, high population cities such as New York City or Boston may have higher background frequencies than smaller cities such as Cheyenne, Wyo. Here, D represents the total set of users in the system. Generally the number of users in set D is significantly larger (e.g. multiple orders of magnitude larger) than the number of matching users in the set of matches Mu. Each user in D may have a corresponding pedigree. Altogether, this forms the set of all pedigrees stored in the user data store 145. - To determine the background frequency, the
location frequency calculator 310 may use a similar indicator function as was previously shown in equation (1) to calculate whether a birth location i, exists in the pedigree corresponding to user w in the set D: -
- The
location frequency calculator 310 summates the total number of pedigrees that each have the birth location, each pedigree corresponding to a user w in the set of D. To calculate the background frequency, thelocation frequency calculator 310 divides the summated total number of users in the set of D that have the birth location by the total number of users in the set of D. Therefore, the background frequency of a birth location, i, is expressed in equation 5 as -
- This background frequency is stored in the
location frequency store 350. - III.c. Match and Background Frequency for a Surname
- The
surname frequency calculator 315 calculates a match frequency and background frequency for each surname in the pedigrees of matching users in a similar fashion as was discussed for birth locations in section III.b. Thesurname frequency calculator 315 receives the set of user matches, Mu, from thegenetic match module 305 for an individual 101 and determines a match frequency pj,that represents how often a given surname, j (e.g. Bradley, O'Malley, Johnson), appears amongst the pedigrees of users, v, in the set of user matches, Mu. For example, for a surname “Bradley” (e.g. j=“Bradley”), -
- where av,j is the indicator function previously described in equation (1). The surname frequency calculator also calculates the background frequency for the surname “Bradley” in the total set of users in the system, D.
-
- The match frequency, pj, and background frequency, qj, may each be stored in the
surname frequency store 360. - Often, there will be many more surnames in the user store 145 than birth locations. Many of those surnames will be similar to others, with only minor variations in spelling. Unmodified, these variant spellings may reduce the efficacy of the surname frequency calculation. To address this, the
surname frequency calculator 315 may first normalize, meaning that thesurname frequency calculator 315 may consider many alternate spellings as being the same surname for purposes of frequency calculations. Examples of such alternate spellings may include use of characters not used in English (e.g., “o” versus “ø”); capitalization, punctuation, and spacing (“O'Malley” versus “Omalley”); suffixes (“Jr.”); and commentary (“Johnson (WWII Veteran)”). A simple normalization is performed that ignores capitalization and punctuation, and removes commentary, thereby reducing the set of surnames under consideration. Alternate spellings and misspellings may be interpreted by thesurname frequency calculator 315 as a different surname. - III.d Calculating the Statistical Likelihood
- The
statistical analysis module 320 identifies which birth locations and surnames are sufficiently notable for the individual 101 under consideration so as to merit possibly providing to the individual 101 as likely being associated with their own ancestors. - For example, there may be a total of 1000 users in the set of matches, Mu. Assume that 10% of the users that are in the set of matches, Mu, have a particular birth location in their pedigree (i.e. mi=100, therefore pi=0.1). A match frequency, pi, of 10% may appear to be a very high number of appearances for a birth location. However, if the background frequency, qi, is also close to 10%, meaning that the birth location appears approximately equally frequently in the pedigrees of all users in the system, then a match frequency, pi, of 10% may not be sufficiently notable to be worth identifying as associated with the individual.
- The
statistical analysis module 320 receives the match frequency, pi, and background frequency, qi, for all different birth locations, i, from thelocation frequency calculator 310. Thestatistical analysis module 320 conducts a statistical analysis test to determine whether the match frequency of a given birth location is sufficiently notable. For each birth location, i, thestatistical analysis module 320 determines the likelihood of observing the received match frequency, pi, and background frequency, qi under a null hypothesis H0 scenario. An example of a null hypothesis H0 is the assumed scenario where the match frequency and background frequency are the same (i.e. pi=qi). Conversely, the alternative hypothesis H1 is the assumed scenario where the match frequency and background frequency are non-equal (i.e. pi≠qi), with the assumption that if, particularly, pi>qi, then pi, and thus i, may be statistically significant and therefore worth possibly providing to the individual. - Therefore, the
statistical analysis module 320 determines the likelihood of observing the received match frequency and background frequency under the assumption that the match frequency, pi, and background frequency, qi, are equal. However, if the received match frequency is sufficiently larger than the received background frequency, then the null hypothesis H0 is rejected in favor of the alternative hypothesis H1. What constitutes a sufficient difference between the received match frequency, pi, and background frequency, qi, will be discussed further below in regards to the summary statistic Si. - A similar calculation may be performed for surnames by receiving the match frequency, pj, and background frequency, qj, for all surname identifications, j, from the
surname frequency calculator 315. The subsequent discussion focuses on conducting a statistical test for a birth location, i. This discussion may also refer to conducting a statistical test for a surname. - The statistical test is performed under a null hypothesis H0, the assumed scenario where the match frequency, pi, and the background frequency, qi, are equal. In various embodiments, the
statistical analysis module 320 conducts a maximum likelihood ratio test. In other examples, the statistical analysis test may be a Pearson's chi-squared test, a Z-test, or a F-test. The test statistic, Λ, for the maximum likelihood ratio test is determined according to: -
- where L(mi|H0) denotes the likelihood of observing mi under the null hypothesis that the match frequency and background frequency are equal (i.e. pi=qi) for a birth location i and maxp∈(0,1)L(mi|H1) denotes the likelihood of observing mi under the alternative hypothesis when varying p between 0 and 1. Thus, the test statistic is a ratio between a first likelihood of observing the match frequency and background frequency under the null hypothesis and a second likelihood of observing the match frequency and background frequency under the alternative hypothesis.
- A summary statistic, Si, is determined using Λ according to:
-
- The
statistical analysis module 320 calculates a summary statistic for each birth location i. Note that if the match frequency, pi, and background frequency, qi, for a birth location received from thelocation frequency calculator 310 are equal, then the value of the summary statistic is zero. Additionally, the summary statistic Si increases in magnitude as the difference between the match frequency, pi, and background frequency, qi, increases in magnitude. - According to Wilks' theorem, as the sample size increases, twice the summary statistic 2Si will follow a first order chi-squared distribution. Therefore, the
statistical analysis model 320 may calculate the p-value for rejecting the null hypothesis H0 based on the first order chi-squared distribution of the summary statistic, Si. - For example at a significance level of 0.995 (p-value=5×10−3), the null hypothesis is rejected if Si>4 (or 2Si>8) based on the first order chi-squared distribution. In other words, if the match frequency is sufficiently larger than the background frequency for a particular birthplace or surname such that the summary statistic Si>4, then the alternative hypothesis (i.e. where the match frequency does not equal the background frequency) is accepted. This indicates that the particular birthplace or surname is sufficiently notable to be associated with the ancestors of the individual 101.
- The exact value of the significance level may vary by implementation, or according to more specific factors. Also, although the above embodiment describes the significance level as being a p-value, in practice it may be any threshold which determines whether or not a particular birth location i or surname j is sufficiently statistically significant to merit consideration for providing to the individuals.
- For birth locations specifically, the
statistical analysis module 320 may adjust the significance level (e.g., p-value) for a birth location i, based on the country of origin of the birth location. In various embodiments, the birth location i from a particular country of origin is determined based on the latitude and longitudinal coordinates associated with the birth location. More specifically, the particular significance level for a country of origin is chosen based on the number of users in the database associated with those countries in their respective pedigrees and the number of matches that a given individual has in the database that are annotated with a pedigree attached to them. For example, a birth location that derives from a country having a large number of users associated with that country (e.g., United States or Nordic countries) may utilize a relatively high significance level (e.g., 0.995), whereas a birth location that derives from a country having relatively few users associated with that country (e.g., Mexico, Russia, Eastern Europe) may utilize a relatively lower significance level (e.g., 0.9). Thus, depending upon the country, the threshold needed to determine whether the difference between the birth location match frequency versus the corresponding background frequency is statistically significant may vary. This allows themodule 300 to better take into account the relative availability of data regarding a particular country in determining whether or not particular birth locations are statistically significant. - III.e Calculating the Enrichment Score
- The
statistical analysis module 320 determines which birth locations and surnames are statistically significant given the information known about them from the underlying pedigree data from users genetically matched to an individual 101. Theenrichment score module 325 uses this binary determination of statistical significance to determine an enrichment score representing a strength of association between the birth location or surname and the ancestors of the individual 101. - To do this, the
enrichment score module 325 determines an enrichment score, xi, for each birth location, i, or enrichment score, xj, for each surname, j. Theenrichment score module 325 receives the summary statistic, Si, for each birth location or the summary statistic, Sj, for each surname. Additionally, theenrichment score module 325 receives the match frequency, pi or pj, and the background frequency, qi and qj, for birth locations and surnames. - To calculate an enrichment score of a birth location, i, at the previously selected significance level of 0.995, the
enrichment score module 325 calculates the enrichment score to be: -
- The exact form of the calculation may vary in practice, particularly the significance level may vary by country as described above. Note that if the match frequency, pi, and the background frequency, qi, are not significantly different, then the enrichment score is close to zero, indicating that the particular birth location may not be very relevant to ancestors of the individual 101. Scaling the match frequency by a factor of log p/q eliminates biases towards highly popular birth locations and surnames because they are likely to have a high background frequency (high q) as well, thereby reducing the enrichment score.
- In one embodiments, the
enrichment score module 325 calculates the enrichment score for a surname, j, in the same manner according to equation 10. In another embodiment, theenrichment score module 325 calculates the enrichment score for a surname, j, as -
- for all summary statistic values.
- The respective enrichment scores for a birth location and surname are stored in the
location score store 380 andsurname score store 390, respectively. - III.f Generating a List of Identified Birth Locations or Surnames
- The
enrichment score module 325 provides the enrichment score associated with each birth location or surname to thelist generation module 330. Thelist generation module 330 may rank and/or provide one or more birth locations or surnames to the individual throughclient device 160 based on their associated enrichment scores. Exactly how thelist generation module 330 provides birth locations and surnames, how many, and in what form (e.g., lists, etc.) may vary by implementation. In one embodiment, thelist generation module 330 may set a minimum threshold in order for a particular birth location or surname to be recommended. For example, a birth location or surname may have to meet a certain minimum threshold enrichment score, and/or must appear in at least some number of pedigrees (e.g., mi≥3) for it to be recommended. - In one embodiment, the
list generation module 330 generates alist 370 including only the top N birth locations or surnames by enrichment score. Thelist 370 is sent through thenetwork 120 to theclient device 160 for consumption by the individual 101. -
FIG. 4 illustrates a process of providing an identified birth location or surname to an individual, according to one embodiment. The birth location andsurname identification system 100 receives 405 a sequence of genetic data from an individual 101. Thesystem 100 identifies users in thesystem 100 that are genetic matches with the individual 101. This may be accomplished by identifying DNA segment matches on the pair of haplotypes for the individual and the pair of haplotypes for users retrieved from the user data store 145. - Each user in the
system 100 has a corresponding pedigree stored in the user data store 145. From the set of all pedigrees in the user data store 145, a subset of matching pedigrees is identified. Each pedigree in the subset of matching pedigrees is associated with a user that is a genetic match of the individual 101. Thesystem 100 determines 420 a match frequency, p, of the birth location or surname amongst the subset of matching pedigrees. Additionally, thesystem 100 determines 425 a background frequency, q, of the birth location or surname amongst the set of all pedigrees. - The
system 100 identifies 430 the likelihood of observing the match frequency and background frequency for a birth location or surname under the assumed scenario that the match frequency and background frequency are equal. Thestatistical analysis module 320 of thesystem 100 conducts a statistical test and determines 435 an enrichment score for each birth location or surname. Based on the statistical test and the enrichment score, the system may provide 440 the birth location or surname to the individual 101. - The birth location and
surname identification system 100 is implemented using one or more computers having one or more processors executing application code to perform the steps described herein, and data may be stored on any conventional non-transitory storage medium and, where appropriate, include a conventional database server implementation. For purposes of clarity and because they are well known to those of skill in the art, various components of a computer system, for example, processors, memory, input devices, network devices and the like are not shown inFIG. 1 . In some embodiments, a distributed computing architecture is used to implement the described features. - In addition to the embodiments specifically described above, those of skill in the art will appreciate that the invention may additionally be practiced in other embodiments. Within this written description, the particular naming of the components, capitalization of terms, the attributes, data structures, or any other programming or structural aspect is not mandatory or significant unless otherwise noted, and the mechanisms that implement the described invention or its features may have different names, formats, or protocols. Further, the system may be implemented via a combination of hardware and software, as described, or entirely in hardware elements. Also, the particular division of functionality between the various system components described here is not mandatory; functions performed by a single module or system component may instead be performed by multiple components, and functions performed by multiple components may instead be performed by a single component. Likewise, the order in which method steps are performed is not mandatory unless otherwise noted or logically required. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
- Algorithmic descriptions and representations included in this description are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules or code devices, without loss of generality.
- Unless otherwise indicated, discussions utilizing terms such as “selecting” or “determining” or “estimating” or the like refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
- Finally, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/184,451 US20210183474A1 (en) | 2015-07-07 | 2021-02-24 | Genetic and genealogical analysis for identification of birth location and surname information |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562189422P | 2015-07-07 | 2015-07-07 | |
US15/203,776 US10957422B2 (en) | 2015-07-07 | 2016-07-06 | Genetic and genealogical analysis for identification of birth location and surname information |
US17/184,451 US20210183474A1 (en) | 2015-07-07 | 2021-02-24 | Genetic and genealogical analysis for identification of birth location and surname information |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/203,776 Continuation US10957422B2 (en) | 2015-07-07 | 2016-07-06 | Genetic and genealogical analysis for identification of birth location and surname information |
Publications (1)
Publication Number | Publication Date |
---|---|
US20210183474A1 true US20210183474A1 (en) | 2021-06-17 |
Family
ID=57685282
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/203,776 Active 2037-06-07 US10957422B2 (en) | 2015-07-07 | 2016-07-06 | Genetic and genealogical analysis for identification of birth location and surname information |
US17/184,451 Pending US20210183474A1 (en) | 2015-07-07 | 2021-02-24 | Genetic and genealogical analysis for identification of birth location and surname information |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/203,776 Active 2037-06-07 US10957422B2 (en) | 2015-07-07 | 2016-07-06 | Genetic and genealogical analysis for identification of birth location and surname information |
Country Status (7)
Country | Link |
---|---|
US (2) | US10957422B2 (en) |
EP (1) | EP3320469A4 (en) |
AU (2) | AU2016290989A1 (en) |
CA (1) | CA2991230C (en) |
MX (1) | MX391111B (en) |
NZ (1) | NZ739413A (en) |
WO (1) | WO2017006284A1 (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9336177B2 (en) | 2007-10-15 | 2016-05-10 | 23Andme, Inc. | Genome sharing |
WO2009051766A1 (en) | 2007-10-15 | 2009-04-23 | 23Andme, Inc. | Family inheritance |
US8463554B2 (en) | 2008-12-31 | 2013-06-11 | 23Andme, Inc. | Finding relatives in a database |
US8990250B1 (en) | 2011-10-11 | 2015-03-24 | 23Andme, Inc. | Cohort selection with privacy protection |
US10437858B2 (en) | 2011-11-23 | 2019-10-08 | 23Andme, Inc. | Database and data processing system for use with a network-based personal genetics services platform |
US10025877B2 (en) | 2012-06-06 | 2018-07-17 | 23Andme, Inc. | Determining family connections of individuals in a database |
US9213947B1 (en) | 2012-11-08 | 2015-12-15 | 23Andme, Inc. | Scalable pipeline for local ancestry inference |
US9977708B1 (en) | 2012-11-08 | 2018-05-22 | 23Andme, Inc. | Error correction in ancestry classification |
BR112020020430A2 (en) * | 2018-04-05 | 2021-03-30 | Ancestry. Com Dna, Llc | COMMUNITY ASSIGNMENTS IN IDENTITY BY LINES AND ORIGIN OF GENETIC VARIETY NETWORKS |
WO2019243969A1 (en) * | 2018-06-19 | 2019-12-26 | Ancestry.Com Dna, Llc | Filtering genetic networks to discover populations of interest |
US11514627B2 (en) | 2019-09-13 | 2022-11-29 | 23Andme, Inc. | Methods and systems for determining and displaying pedigrees |
US11817176B2 (en) | 2020-08-13 | 2023-11-14 | 23Andme, Inc. | Ancestry composition determination |
WO2022076909A1 (en) | 2020-10-09 | 2022-04-14 | 23Andme, Inc. | Formatting and storage of genetic markers |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6277567B1 (en) * | 1997-02-18 | 2001-08-21 | Fitolink Corporation | Methods for the construction of genealogical trees using Y chromosome polymorphisms |
US6633819B2 (en) * | 1999-04-15 | 2003-10-14 | The Trustees Of Columbia University In The City Of New York | Gene discovery through comparisons of networks of structural and functional relationships among known genes and proteins |
US20030195707A1 (en) * | 2000-05-25 | 2003-10-16 | Schork Nicholas J | Methods of dna marker-based genetic analysis using estimated haplotype frequencies and uses thereof |
US7957907B2 (en) * | 2001-03-30 | 2011-06-07 | Sorenson Molecular Genealogy Foundation | Method for molecular genealogical research |
US6909971B2 (en) * | 2001-06-08 | 2005-06-21 | Licentia Oy | Method for gene mapping from chromosome and phenotype data |
US6886015B2 (en) * | 2001-07-03 | 2005-04-26 | Eastman Kodak Company | Method and system for building a family tree |
WO2003009210A1 (en) * | 2001-07-18 | 2003-01-30 | Gene Logic, Inc. | Methods of providing customized gene annotation reports |
US8855935B2 (en) * | 2006-10-02 | 2014-10-07 | Ancestry.Com Dna, Llc | Method and system for displaying genetic and genealogical data |
US20050147947A1 (en) * | 2003-12-29 | 2005-07-07 | Myfamily.Com, Inc. | Genealogical investigation and documentation systems and methods |
US7249129B2 (en) * | 2003-12-29 | 2007-07-24 | The Generations Network, Inc. | Correlating genealogy records systems and methods |
US8285486B2 (en) * | 2006-01-18 | 2012-10-09 | Dna Tribes Llc | Methods of determining relative genetic likelihoods of an individual matching a population |
US8700334B2 (en) * | 2006-07-31 | 2014-04-15 | International Business Machines Corporation | Methods and systems for reconstructing genomic common ancestors |
US8661048B2 (en) * | 2007-03-05 | 2014-02-25 | DNA: SI Labs, Inc. | Crime investigation tool and method utilizing DNA evidence |
US20080228700A1 (en) * | 2007-03-16 | 2008-09-18 | Expanse Networks, Inc. | Attribute Combination Discovery |
WO2009117122A2 (en) * | 2008-03-19 | 2009-09-24 | Existence Genetics Llc | Genetic analysis |
US20110093448A1 (en) * | 2008-06-20 | 2011-04-21 | Koninklijke Philips Electronics N.V. | System method and computer program product for pedigree analysis |
US8413188B2 (en) * | 2009-02-20 | 2013-04-02 | At&T Intellectual Property I, Lp | System and method for processing image objects in video data |
US8224821B2 (en) * | 2009-07-28 | 2012-07-17 | Ancestry.Com Operations Inc. | Systems and methods for the organized distribution of related data |
WO2011025400A1 (en) * | 2009-08-30 | 2011-03-03 | Cezary Dubnicki | Structured analysis and organization of documents online and related methods |
US8185557B2 (en) * | 2010-01-27 | 2012-05-22 | Ancestry.Com Operations Inc. | Positioning of non-constrained amount of data in semblance of a tree |
US8786603B2 (en) * | 2011-02-25 | 2014-07-22 | Ancestry.Com Operations Inc. | Ancestor-to-ancestor relationship linking methods and systems |
US9116882B1 (en) * | 2012-08-02 | 2015-08-25 | 23Andme, Inc. | Identification of matrilineal or patrilineal relatives |
US20140067355A1 (en) * | 2012-09-06 | 2014-03-06 | Ancestry.Com Dna, Llc | Using Haplotypes to Infer Ancestral Origins for Recently Admixed Individuals |
WO2014145280A1 (en) * | 2013-03-15 | 2014-09-18 | Ancestry.Com Dna, Llc | Family networks |
-
2016
- 2016-07-06 US US15/203,776 patent/US10957422B2/en active Active
- 2016-07-07 WO PCT/IB2016/054094 patent/WO2017006284A1/en active Application Filing
- 2016-07-07 MX MX2018000293A patent/MX391111B/en unknown
- 2016-07-07 CA CA2991230A patent/CA2991230C/en active Active
- 2016-07-07 EP EP16820936.9A patent/EP3320469A4/en active Pending
- 2016-07-07 NZ NZ73941316A patent/NZ739413A/xx unknown
- 2016-07-07 AU AU2016290989A patent/AU2016290989A1/en not_active Abandoned
-
2021
- 2021-02-24 US US17/184,451 patent/US20210183474A1/en active Pending
- 2021-11-16 AU AU2021269307A patent/AU2021269307A1/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
NZ739413A (en) | 2018-01-26 |
EP3320469A1 (en) | 2018-05-16 |
AU2021269307A1 (en) | 2021-12-09 |
WO2017006284A1 (en) | 2017-01-12 |
US20170011042A1 (en) | 2017-01-12 |
US10957422B2 (en) | 2021-03-23 |
AU2016290989A1 (en) | 2018-02-22 |
EP3320469A4 (en) | 2019-03-06 |
MX391111B (en) | 2022-03-30 |
MX2018000293A (en) | 2018-03-08 |
CA2991230C (en) | 2023-09-26 |
CA2991230A1 (en) | 2017-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210183474A1 (en) | Genetic and genealogical analysis for identification of birth location and surname information | |
US11515047B2 (en) | Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform | |
Shah et al. | Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study | |
US20160092793A1 (en) | Pharmacovigilance systems and methods utilizing cascading filters and machine learning models to classify and discern pharmaceutical trends from social media posts | |
Ahlgren et al. | A nationwide survey of the prevalence of multiple sclerosis in immigrant populations of Sweden | |
Morgenstern et al. | Perspective: Big data and machine learning could help advance nutritional epidemiology | |
US20140143346A1 (en) | Identifying And Classifying Travelers Via Social Media Messages | |
Chan et al. | Reproducible extraction of cross-lingual topics (rectr) | |
CN112819548A (en) | User portrait generation method and device, readable storage medium and electronic equipment | |
Luyts et al. | A Weibull-count approach for handling under-and overdispersed longitudinal/clustered data structures | |
CN113077312A (en) | Hotel recommendation method, system, equipment and storage medium | |
US9122705B1 (en) | Scoring hash functions | |
US20170075519A1 (en) | Data Butler | |
Liu | Examining Nonnormal Latent Variable Distributions for Non-Ignorable Missing Data | |
Long | COVID-19 Real-Time Tracker and Analytical Study | |
Do et al. | Importance bootstrap resampling for proportional hazards regression | |
Azevedo et al. | Estimating hidden populations by transferring knowledge from geographically misaligned levels | |
Albert et al. | Modelling batched Gaussian longitudinal weight data in mice subject to informative dropout |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED |
|
AS | Assignment |
Owner name: ANCESTRY.COM DNA, LLC, UTAH Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KERMANY, AMIR R.;GRANKA, JULIE M.;NOTO, KEITH D.;SIGNING DATES FROM 20170504 TO 20171002;REEL/FRAME:056104/0179 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: WILMINGTON TRUST, NATIONAL ASSOCIATION, AS NOTES COLLATERAL AGENT, DELAWARE Free format text: PATENT SECURITY AGREEMENT;ASSIGNORS:ANCESTRY.COM DNA, LLC;ANCESTRY.COM OPERATIONS INC.;REEL/FRAME:058536/0278 Effective date: 20211217 Owner name: CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLATERAL AGENT, NEW YORK Free format text: PATENT SECURITY AGREEMENT;ASSIGNORS:ANCESTRY.COM DNA, LLC;ANCESTRY.COM OPERATIONS INC.;REEL/FRAME:058536/0257 Effective date: 20211217 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |