METHOD FOR ENCODING SINGLE NUCLEOTIDE POLYMORPHISM DATA
CROSS-REFERENCE TO RELATED APPLICATIONS Not applicable.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
To be determined.
BACKGROUND OF THE INVENTION Field of the Invention
The present invention relates to the genotyping of individuals, and in particular, relates to ternary encoded genotyping for single nucleotide polymorphisms to unambiguously identify an individual.
Description of the Related Art
The technology of genetic testing is moving rapidly out of the laboratory and into commercial use. Genetic testing involves the application of modern techniques of molecular genetics to biological specimens from individuals so that the differences in the molecules of genetics between individuals can be identified and, eventually, cataloged. One goal of modern genetic testing is to enable individuals, or biological material from those individuals to be uniquely identified based on the content of the DNA in that individual. The technology of DNA genetic identification, also referred to as genotyping, is used in a wide variety of applications. It can be used for testing paternity in which there is a comparison done between a suspected parent's genotype and a child's genotype. Genetic identification is used for many forensic purposes in which DNA contained in biological materials exuded by an individual or found at a crime scene is compared with
the individual's own DNA for a variety of court, scientific, legal, and criminal processes. The armed forces are interested in using genetic identification as a means of identifying service members uniquely and in a way that can be verified later, regardless of the status of the remains of the individual. Genetic identification is also used in animal studies for lineage testing and for breeding particularly in large mammals such as horses and cows. Genetic testing is also useful in studying linkage and association of the location and characterization of disease genes for purposes of identifying disease genes and helping with drug design. The identification of the location of DNA differences, or polymorphisms, is also useful in breeding plants and animals for agricultural or other purposes in which the tracking of genes or chromosomes transferred by normal sexual inheritance or inserted by genetic engineering is an important part of the process.
A species of genetic identification is based on the existence of single nucleotide polymorphisms or SNPs. SNPs are locations in the genome of a type where there are commonly single base pair variations in nucleotide sequence among individuals of the species. Specifically identified SNPs that vary from one individual to the next can serve as physical land marks along a chromosome. SNPs are believed to be frequently distributed along the genome, occurring in the human genome at a frequency of about 1 per 100 to 1000 bases of genomic DNA. SNPs may occur in genes or may occur between genes in the genome of an individual. Many large pharmaceutical companies have recently invested heavily in the creation of archives of SNPs which will then be publicly available. A consortium has been funded which hopes to identify 300,000 SNPs over a two year period. Presently, it is known that a plurality of SNPs from an unknown individual may be compared to those of known individuals to determine if a match exists for identification purposes. For example, for diploid organisms, a biallelic chromosome consequently having two possible nucleotides will be capable of producing three possible combinations (i.e. "Aa" "AA" and "aa"). A method for recording and archiving data related to SNPs is irregular and no convention has been established. Accordingly, identification of an individual is currently accomplished by comparing each SNP individually from one unknown individual to those of a plurality of known individuals to determine whether a match exists. This is a tedious and time-consuming process.
What is therefore needed is an improved, time-efficient method for automatically comparing a series of SNPs for an unknown individual to a database of SNPs corresponding to known individuals.
BRIEF SUMMARY OF THE INVENTION In accordance with a first aspect of the invention, a method for storing genetic information about a given individual on a digital computer includes the steps of determining the genotypes of a plurality of single nucleotide polymorphisms for the given individual, and creating a ternary number having a number of places equal to the number of single nucleotide polymorphisms. The method next includes the step of mapping a relative location for each single nucleotide polymorphism within the plurality of single nucleotide polymoφhisms to a predetermined place in the ternary number, and identifying each single nucleotide polymorphism as a predetermined digit in the ternary number. The ternary identification number is then converted to a binary identification number, which is then stored on a storage medium accessed by the computer, wherein the binary number serves as a representation of the genotype of the given individual. These as well as other features and characteristics of the present invention will be apparent from the description which follows. In the detailed description below, preferred embodiments of the invention will be described with reference to the accompanying drawings. These embodiments do not represent the full scope of the invention. Rather the invention may be employed in other embodiments, and reference should therefore be made to the claims herein for inteφreting the breadth of the invention.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS Fig. 1 is a diagram of the first step of a two-step reaction to genotype an SNP in accordance with the preferred embodiment;
Fig. 2a is a diagram showing the second step of the two-step process for a SNP having a "A" genotype;
Fig. 2b is a diagram showing the second step of the two-step process for a SNP having a "a" genotype;
Fig. 3 is a diagram showing a biallelic SNP genetic marker having "AA," "Aa", and "aa" genotypes along with corresponding mass spectra data in accordance with the preferred embodiment;
Fig. 4 is a diagram showing the results of genotyping a series of SNP markers for three individuals tested in accordance with the preferred embodiment; and
Fig. 5 is a schematic diagram showing a method of recording and archiving SNP data in accordance with the preferred embodiment.
Fig. 6 is a flow chart illustrating the steps in accordance
DETAILED DESCRIPTION OF THE INVENTION The present invention is a ternary-encoding method for SNP genotyping.
Specifically, the method is based on a ternary, or base 3, encoding approach to the data. Under the approach described here, it is assumed that the SNP at a given locus on the chromosome is biallelic, and biallelic only. In other words, there are occurring in the population two possible nucleotides present at this location on each of the chromosomes of an individual. If the two alleles, or single nucleotides, at that specific SNP locus are designated "A" and "a," then the only possible genotypes for an individual having two chromosomes are "AA," "Aa," or "aa." The scheme for nomenclature for these possibilities is designated here to be ternary, or based on base 3. The three possible genotypes are arbitrarily assigned ternary digits of, for example, 0, 1, or 2, respectively for the three possible genotypes "AA," "Aa," or "aa."
Obviously a single genotype for a single SNP will not be diagnostic for a single individual. Therefore the concept here is that there will be a set of SNP markers designated for individuals. Because each SNP within the predetermined set of SNP markers may vary from one individual to the next, the set of SNP markers will then be used to create a unique ternary code for that individual consisting of a series of digits, each digit of the code representing the genotype of that individual at a specific SNP locus used as a marker for this individual. The result is a long ternary number having a number of digits equal to the number of SNP markers used that may be used to identify a given individual. The position of each digit of the ternary number corresponds to a position of each SNP within the series, while the identity of each digit (i.e. 0, 1, or 2)
identifies the genotype of the SNP, as described above. Accordingly, the ternary number may be examined to determine the genotypes of the entire series of SNPs. An advantage of this technique is that the long ternary number thus created (ternary code), which is diagnostic of the genotype of the individual, can be converted into a binary number by easily utilizing accomplished numerical computation techniques. Binary numbers corresponding to the genotypes of a series of SNPs for a plurality of individuals may therefore be stored on a recordable medium that is accessed by, for example, a general puφose computer programmed to accept, store, archive and process genetic and genotyping information. As a result, a ternary code of an unknown individual may be converted to a binary number and compared to the existing database of binary numbers to identify the unknown individual if a match exists. Additionally, the computer could convert the binary numbers back to the initial ternary codes and print or display the codes so that they may be manually compared to an unknown code. Upon comparison, one will be able to identify the location of differing SNP genotypes, and the identity of the SNP for the individuals. This approach thus serves as an effective genetic identification method for each individual because each individual can be assigned a unique binary number derived from the ternary encoded genotyping set over the entire set of SNPs that is easily processed by the computer.
Alternatively, the ternary code may be converted to a decimal number, or conceivably any other base number, which will compress the code into a smaller number that could, for example, serve as an identification number for each individual to whom the number corresponds. The resulting decimal number is a simple representation of the individual's genotype data over the whole panel of SNPs which are analyzed. In other words, the individual's number, even in base 10, is characteristic of the individual's genotype and can be readily reconverted back to the ternary number indicating the specific alleles present at each marker location in the genotype of that individual. Representation of genotyping data by this simple decimal numerical identification provides a simple method of comparing identities of individuals useful in any of the applications for which genotyping might be appropriate. All that is required is that the number of SNPs selected be sufficient so as to make it statistically highly improbable that any two individuals would have the same decimal identification number.
This scheme also enables efficient comparative analysis of SNP genotyping data between large numbers of individuals. Using the ternary representation of the SNP information from each individual, a comparative code can be created that is a reflection of the variability in genotypes between any two individuals over the whole panel of SNP markers. This can be done simply by subtracting each of the ternary digits in the code of one individual from the respective ternary digits in the code of another. In this operation, the digits are maintained in register with then corresponding digit in the other ternary number. Any non-zero digits in the comparative code created by this comparison indicate markers with a genotype between the two individuals are different. Using the comparative subtractive analysis, an average genotypic variability can be calculated at each marker which is a direct measurement of the variation between many individuals at a given SNP. This provides an effective way to compare genotyping data within and among populations of individuals which is useful in genetic linkage and association studies as well as evolutionary genetic studies. The same ternary encoded genotyping data can be used to determine the frequency of each allele at that SNP throughout the population which may be important in the statistical analysis and genetic studies.
Specifically it is envisioned that a set of SNPs would be identified for a species. The set of SNPs might include two SNPs on each chromosome and may need to include more SNPs in each chromosome (or perhaps less on some chromosomes) in order to have a sufficient number of SNPs to create numbers which are statistically likely to be diagnostic of the individuals within the population. By obtaining a DNA sample from each individual and testing each of those DNA samples for the SNP, a ternary SNP code can be created for each individual having the number of digits equal to the number of SNPs tested. Converting the number to base 10 provides an unique number associated with each individual which is identifying and which can be used to uniquely identify biological materials or specimens from that individual.
The ternary encoded genotyping information can be stored and used in several ways. The ternary identifying number, or the digital compression of that number, can be written and stored in any medium of tangible expression. It can be written and stored in hard copy archives or ledgers. This sort of information is most appropriate for storage
in general purpose digital computers programmed to accept, store, archive and process genetic and genotyping information. Much genetic information is being compiled and stored in digital computers and this mode of information formatting and storage is quite compatible with and convenient for use with modern computers programmed to handle genetic information. The data bank can be stored in any appropriate form of data storage which may be accessed by a computer. Thus it is envisioned that data banks of numbers in ternary or compressed digital format will be created, each member of the bank representing the genotype of a given individual in a population. Then it becomes possible to analyze the SNP profile of a sample, generate a ternary or digital number representing the genotype of that sample and then search the data bank for a match based on that single number. In this way, matching the genotype of tissues to individuals can become simple, quick and accurate. Similar data banks can be created for other applications such an animal breeding and the like.
EXAMPLES The genotyping technique employed in accordance with the preferred embodiment was the MALDI-TOF MS analysis of SNPs using the Invader® assay, as the described in Griffin, T. J.; Hall, J. G.; Prudent, J.R.; Smith, L.M. Proc. NatlAcad. Sci. USA 1999, 96, 6301-6306, the disclosure of which is hereby incoφorated by reference. Fig. 1 illustrates the first of the two-step genotyping process, whereby an invader oligonucleotide and an "a allele" probe oligonucleotide and "A allele" probe oligonucleotide are hybridized to a nucleic acid target of interest (genomic DNA target). The 3' nucleotide of the target was overlapped by the invader, and the 3' nucleotide of the probes were bonded to the 5' nucleotide of the target and were cleaved at the 5' probe nucleotides. Because only one of the probes will bond to a single allele of the target, the cleavage therefore produced allele-specific cleavage products. Referring now to Figs. 2A and 2B, the cleavage products were mixed with a secondary probe and secondary target to produce biotin, which was analyzed using mass spectroscopy to determine the identity of the SNP under test. Specifically, referring also to Fig. 3, the resulting signal molecule corresponding to allele "A" was dT4-biotin having a deprotonated, negative singly charged molecular ion mass-to-charge [(M-H)-] value of
1538, while the signal molecule corresponding to allele "a" was dT3-biotin having a (M- H)- value of 1234. It therefore was observed that a resulting (M-H)- value of 1538 indicated a "AA" genotype at the SNP, peaks at 1234 and 1538 indicated genotype "Aa," and a single peak at 1234 indicated a "aa" genotype. The MALDI-TOF MS analysis was performed on a Perceptive Biosystems
(Framingham, MA) Voyager DE-STR mass spectrometer using a nitrogen laser at 337 nm with an initial accelerating voltage of 20 kV and a delay time of 100 nanoseconds. The instrument was run in reflector more using negative ion detection with external calibration. All spectra acquired consisted of averaged signal from 50-100 laser shots and the data was processed using accompanying Perceptive Biosystems mass spectrometry software. The resulting It was observed that the fremass spectrometric analysis
While the MALDI-TOF MS analysis was used in accordance with the preferred embodiment, it should be appreciated that any genotyping technique capable of identifying a biallelic SNP at a given locus may be used in accordance with the present invention. In some cases, using the MALDI-TOF MS analysis, signal suppression was observed. An additional 1 mL of aCHCA matrix to each sample spot after spotting and drying the resuspended sample onto the first layer of the matrix was added to increase the signal intensity. This process was performed for eleven individuals for a panel of seven strategically chosen SNP locations that were known to vary from one individual to the next. Table 1 illustrates the SNP marker panel showing the two nucleotides corresponding to each of the possible alleles at each SNP marker, along with previously estimated allele frequencies, and resulting Hardy- Weinberg genotype frequencies for each SNP. The more frequently occurring allele was identified as "A" and the less frequent allele was identified as "a."
Table 1. SNP marker panel.
Referring also to Fig. 4, the (M-H)- values for each of seven SNP locations for three of the eleven tested individuals are represented and a resulting seven-digit ternary identification code was produced having values of 0, 1 , or 2 at each location depending on whether the corresponding SNP was determined to be "AA," "Aa," or "aa," respectively. Thus, each digit in the ternary code corresponds to a single nucleotide polymoφhism at a defined locus, and the identity of each digit corresponds to the genotype of the individual. The resulting ternary code was then converted to a base- 10, or decimal, number using one of any known conversion methods. The resulting decimal number is shorter in length than the ternary number and may be more easily archived for more rapid subsequent retrieval using a completely automated analysis system by further converting the decimal (or ternary) number to a binary number. The decimal number may be converted back to a ternary number when one wishes to identify the genotypes of each SNP for an individual. It should be appreciated that increasing the number of SNP locations sets results in a lower probability that the resulting ternary code would be matched by another individual, thereby increasing the probability of unique identification.
The ternary encoding results for each of the eleven tested individuals are illustrated below in Table 2, along with the resulting composite ternary identification code and corresponding binary and decimal identification numbers. Additionally, the frequency of a given identification code being matched by another individual is estimated using the product rule, where the known Hardy- Weinberg frequencies for each individual SNP are multiplied over the panel of SNPs. The product rule is
described in National Research Council (U.S.), Committee on DNA Forensic Science, The Evaluation of Forensic DNA Evidence: An Update; National Academy Press: Washington, DC 1996, the disclosure of which is hereby incoφorated by reference.
Table 2. Ternary-encoded genotyping results.
Referring now to Figs. 5 and 6, the method of archiving an individual's genotype in accordance with the preferred embodiment is illustrated with reference to Individual #1, as described above. Specifically, a set SNPs of an individual (7 for Individual #1) is genotyped as described above at step 10. Next, at step 12, a ternary number is created having an equal number of digits corresponding to the number of SNPs that are genotyped for the individual. For instance, because seven SNPs were genotyped for Individual #1, a seven digit ternary number is created. Next, each SNP location is mapped to an arbitrarily determined place on the ternary number at step 14. For instance, as shown in Fig. 5, each SNP location (from left to right) is mapped directly to corresponding position of the ternary number (i.e. left to right). However, it should be appreciated that the present invention contemplates other types of mapping arrangements. For example, a select few of the SNP locations across the panel having greater variability from one individual to the next may be selected to be mapped first in the ternary number if, for example, one wished to compare specific traits of the individuals. Alternatively, the SNP locations could be arbitrarily mapped to the places on the ternary number so long as the mapping is consistent from one individual to the next.
Next, at step 16, the genotype of each SNP is converted into a specific ternary digit at the previously determined mapped location. Accordingly, the specific genotype is identified by the ternary number at the predetermined mapped location for each individual SNP, as described above. In accordance with the preferred embodiment, and as illustrated in Fig. 5, a ternary number is assigned for each SNP, whereby a "0" indicates genotype "AA," "1" indicates genotype "Aa," and "2" indicates genotype "aa." The resulting ternary code corresponding to Individual #1 is "2110112." However, it should be appreciated by one having skill in the art that any numbers or characters may be used to represent the various genotypes at a given SNP. At step 18, the ternary number is converted to a binary number "11100000100" and stored in a computer- accessed memory (indicated by dashed lines as illustrated in Fig. 5). The ternary number may additionally be converted to a more compact decimal number at step 20 so as to be easily memorized by a human. The ternary and binary numbers illustrated in Fig. 5 are the equivalent of decimal number 1796. The double arrows illustrate the equivalence of the binary, ternary, and decimal numbers and that they may be interchanged from one to another for the puφoses of identification.
The invention has been described in connection with are presently considered to be the most practical and preferred embodiments. However, the present invention has been presented by way of illustration and is not intended to be limited to the disclosed embodiments. Accordingly, those skilled in the art will realize that the invention is intended to encompass all modifications and alternative arrangements included within the spirit and scope of the invention, as set forth by the appended claims.