WO2008079374A2

WO2008079374A2 - Methods and compositions for selecting and using single nucleotide polymorphisms

Info

Publication number: WO2008079374A2
Application number: PCT/US2007/026241
Authority: WO
Inventors: Eric T. Wang; Pierre Baldi; Robert K. Moyzis
Original assignee: Wang Eric T; Pierre Baldi; Moyzis Robert K
Priority date: 2006-12-21
Filing date: 2007-12-21
Publication date: 2008-07-03
Also published as: WO2008079374A3

Abstract

The invention provides methods of characterizing SNPs, sets of SNPs, genes, and/or gene fragments characterized by such methods, reagents and compositions using SNPs and sets of SNPs, genes, and/or gene fragments characterized by such methods, methods of assigning predictive value to characterized SNPs, diagnostic, prognostic, and treatment methods and compositions based on such predictive values, and business methods.

Description

METHODS AND COMPOSITIONS FOR SELECTING AND USING SINGLE NUCLEOTIDE POLYMORPHISMS

RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Patent Application No.

60/876,674, entitled "Methods and Compositions for Selecting and Using Single Nucleotide Polymorphisms," filed December 21, 2006. An accompanying duplicate pair of compact discs (CD) containing 12.txt format (ASCII) files were also filed along with the above-referenced 60/876,674.

ELECTRONIC FILES SUBMITTED ON ACCOMPANYING DUPLICATE COMPACT DISCS

[0002] This application refers throughout to tables 1-12. Due to the excessive length of tables 1-12, they are submitted herewith in electronic (.txt) format contained on the accompanying compact discs. The contents of the tables is identical to that of the tables electronically with the above referenced U.S. provisional patent application 60/876,674 filed on December 21, 2006. Thus, these tables do not introduce new matter. The contents of the tables encoded on the accompanying duplicate compact discs (CD) are summarized as follows. The CD contains 12 .txt format (ASCII) files corresponding to Tables 1-12.

File Name Date Created Size

TABLE l.txt 12/21/2007 542 KB

TABLE 2.txt 12/21/2007 434 KB

TABLE 3.txt 12/21/2007 183 KB

TABLE 4.txt 12/21/2007 288 KB

TABLE 5.txt 12/21/2007 173 KB

TABLE 6.txt 12/21/2007 267 KB

TABLE 7.txt 12/21/2007 206 KB

TABLE 8.txt 12/21/2007 315 KB

TABLE 9.txt 12/21/2007 189 KB

TABLE 10.txt 12/21/2007 315 KB

TABLE 1 l.txt 12/21/2007 28 KB

TABLE 12.txt 12/21/2007 40 KB

[0003] The contents of each table are described as follows:

Table 1 (3 headings): List #; RefSNP ID; and Chromosome. List # is the position of each item in the list;

RefSNP ID refers to the reference SNP identifier number; and Chromosome refers to the chromosome to which the SNP was mapped. Table 1 lists a total of 25,386 SNPs identified from the Perlegen SNP dataset, as having one allele for which the ALnLH score was greater than 2.6 SD away from the average ALnLH score for the entire dataset, and the alternate allele as having an ALnLH score less than or equal to 1 SD away

the average ALnLH score.

Table 2 (5 headings): List #; RefSNP ID; Chromosome; Gene Accession ID; and Gene Symbol. The first three headings are described above. Gene accession ID refers to an NCBI-accessible identification number corresponding to a known and annotated gene falling within ± 100 Kb of the corresponding SNP as indicated in the table. The gene symbol is an NCBI-recognized abbreviation for the corresponding gene. A total of

11,330 Perlegen dataset SNPs (a subset of those found in Table 1) is listed in Table 2. Note the following points: (1) in some cases, the gene symbol and gene accession ID are the same; (2) for some SNPs, no gene symbol or gene accession ID are listed, as the proximal gene had not been fully characterized and/or annotated.

Table 3 (3 headings; same headings as Table 1). Table 3 lists a total of 8,903 SNPs identified from the

HapMap CEU dataset.

Table 4 (5 headings): List #; RefSNP ID; Chromosome; Gene Symbol; Gene Accession ID each of which is described above. Table 4 lists a subset of the SNPs in Table 3, for a total of 8,717 HapMap (CEU)

SNPs falling within ± 100 Kb of an annotated gene (with the exception noted above in point 2).

Table 5 (3 headings; same headings as Table 1) Table 5 lists a total of 8,386 SNPs identified from the HapMap CHB dataset.

Table 6 (5 headings; same headings as Table 4) Table 6 lists a subset of the SNPs in Table 5, for a total of

8,249 HapMap (CHB) SNPs falling within ± 100 Kb of an annotated gene (with the exception noted above in point 2).

Table 7 (3 headings; same headings as Table 1) Table 7 lists a total of 10,000 SNPs identified from the

HapMap JPT dataset.

Table 8 (5 headings; same headings as Table 4) Table 8 lists a subset of the SNPs in Table 7, for a total of

9,846 HapMap (JPT) SNPs falling within ± 100 Kb of an annotated gene (with the exception noted above in point 2). Table 9 (3 headings; same headings as Table 1) Table 9 lists a total of 9,165 SNPs identified from the

HapMap YRI dataset..

Table 10 (5 headings; same headings as Table 4) Table 10 lists a subset of the SNPs in Table 7, for a total of

8,983 HapMap (JPT) SNPs falling within ± 100 Kb of an annotated gene (with the exception noted above in point 2).

Table 11 (5 headings): List #; RefSNP ID; Chromosome; Major Allele ALnLH; Minor Allele

ALnLH.

Table 11 is an illustrative list of 500 SNPs from the HapMap CEU SNP dataset. For each SNP in the list, the

ALnLH score for the major and minor alleles are provided.

Table 12 (5 headings; same headings as Table 4) Table 12 lists a set of selected SNPs common to the

Perlegen SNP dataset and the HapMap CEU, CHB, JPT, and YRI SNP datasets, for a total of 1,092 SNPs falling within ± 100 Kb of a gene (with the exception noted above in point 2).

The machine format for the duplicate compact disks is IBM-PC, and the operating system compatibility is MS-Windows.

BACKGROUND

[0004] The vast amount of human DNA sequence data currently available is driving an unprecedented effort to analyze the relationship between single nucleotide polymorphisms (SNPs), which are scattered throughout the human genome, and the occurrence of common health conditions, e.g., neurodegenerative diseases, psychiatric disorders, metabolic disorders, and cardiovascular diseases. More specifically, the goal is to link the occurrence of each health condition with the presence of a relatively small subset of linked SNP alleles. In practice, however, this is a daunting task, as up to four million SNPs have been identified so far, with a total of up to twelve million SNPs estimated to exist in the human population.

SUMMARY OF THE INVENTION

[0005] The invention provides methods of characterizing SNPs, sets of SNPs, genes, and/or gene fragments characterized by such methods, reagents and compositions using SNPs and sets of SNPs, genes, and/or gene fragments characterized by such methods, methods of assigning predictive value to characterized SNPs, diagnostic, prognostic, and treatment methods and compositions based on such predictive values, and business methods.

[0006] In one aspect, the invention provides a method of characterizing a SNP by determining a quantitative measure of the probability of selective pressure on the major or minor allele of the SNP. In some embodiments, at least part of the determining is performed using a computer. In some embodiments, the method includes selecting a genotype for analysis based on homozygosity of the genotype for the major or minor allele of the SNP. In some embodiments, the analysis includes scoring zygosity of SNP loci within a predetermined distance of the SNP to be characterized. In some embodiments, a plurality of genotypes homozygous for the major allele or a plurality of genotypes homozygous for the minor allele are analyzed. In some embodiments, the analysis further comprises determining an inferred fraction of recombinant chromosomes for one or more of the SNP loci based on the scoring. In some embodiments the SNP is not selected for characterizing based on an association of the SNP with a phenotype. In some embodiments, the method further includes comparing the value of the quantitative measure for the major or minor allele to a predetermined value. In some embodiments, the method further includes comparing the value of the quantitative measure for the major allele to the value of the quantitative measure for the minor allele. In some embodiments, the method further includes selecting or not selecting the characterized SNP for inclusion in a set of SNPs on the basis of value of the quantitative measure. In some embodiments, a plurality of SNPs are characterized. In some embodiments, the measure of the probability of selective pressure is determined by analyzing SNPs within a predetermined distance of the SNP to be characterized, e.g., a distance such that, on average, at least an additional 50, 100, or 300 SNPs are found in the distance; or, e.g., a distance that is at least about 10, 50, 200, 500, or 1000 kilobases. In some embodiments, the quantitative measure is determined by analyzing SNPs within a predetermined distance of the SNP to be characterized and further by determining the fraction of inferred recombinant chromosomes for a plurality of the SNPs found within the predetermined distance. The method can further include creating a list of value pairs for each of the SNPs in the plurality of SNPs found within the predetermined distance, where each value pair in the list includes a value for the distance away from the site of the SNP to be characterized and the fraction of recombinant chromosomes. The method can also include computing an average log likelihood (ALnLH) for the major or minor allele based on the sum of the square of the differences between a model of the fraction of recombinant chromosomes within a predetermined distance from a positively selected SNP allele and actual data for inferred fraction of recombinant chromosomes within the predetermined distance from the minor or major alleles of the SNPs to be characterized. In some embodiments, the ALnLH is compared with a predetermined value, e.g. an average ALnLH (Av-ALnLH) value for a plurality of SNPs, e.g., for some or all of the SNPs in a set of SNPs such as a genome-wide set of SNPs. In some embodiments the ALnLH of the minor allele and the ALnLH of the major allele are compared to the Av-ALnLH value.

[0007] In another aspect, the invention provides compositions. In some embodiments, the invention provides a nucleic acid array that includes SNP probes, where one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by a method of characterizing a SNP by determining a quantitative measure of the probability of selective pressure on the major or minor allele of the SNP. In some embodiments, the one or more probes include at least 30%, 70%,or substantially all of non-redundant probe sequences in the array.

[0008] In a further aspect, the invention provides a method of performing an array -based SNP assay, by conducting a nucleic acid array-based assay on a nucleic acid sample from a subject, where the nucleic acid array is a nucleic acid array that includes SNP probes, where one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by a method of characterizing a SNP by determining a quantitative measure of the probability of selective pressure on the major or minor allele of the SNP..

[0009] In a yet further aspect, the invention provides kits. In one embodiment, the invention provides a kit for use in screening a nucleic acid sample for the presence of the major or minor allele of one or more SNPs, said kit comprising a SNP detection reagent and a control nucleic acid sample, wherein the one or more SNPs are SNPs selected by a method that includes of characterizing a SNP by determining a quantitative measure of the probability of selective pressure on the major or minor allele of the SNP.

[0010] In another aspect, the invention provides a collection of a plurality of different SNP profiles each paired with a specific nucleic acid source, wherein (i) said collection is recorded on a substrate and (ii) the SNP profile includes the genotypes of the SNPs selected according to a method that includes of characterizing a SNP by determining a quantitative measure of the probability of selective pressure on the major or minor allele of the SNP. In some embodiments, the substrate is a computer readable medium.

[0011] In a further aspect, the invention provides a method of characterizing a SNP by determining a numerical quantity for the major or minor allele of the SNP, where the numerical quantity is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of the SNP for the major or minor allele, and where the analyzing is performed on a genotype that is homozygous for the major or minor allele. In some embodiments, the predetermined distance is such that, on average, at least an additional 50, 100, or 300 SNPs are found in the distance. In some embodiments, the predetermined distance is at least about 10, 50, 200, 500, or 1000 kilobases. In some embodiments, at least part of the determining is performed using a computer. In some embodiments, the SNP is not selected on the basis of an association with a phenotype. In some embodiments, the method further includes comparing the numerical quantity determined for the major or minor allele to a predetermined value. In some embodiments, the method further includes the numerical quantity determined for the major allele to the numerical quantity determined for the minor allele. In some embodiments, the method further includes selecting or not selecting the SNP for inclusion in a set of SNPs on the basis of value of the quantitative measure. In some embodiments, a plurality of SNPs are characterized. In some embodiments, the method further includes creating a list of value pairs for each of the SNPs in the plurality of SNPs found within the predetermined distance, where each value pair in the list comprises a value for the distance away from the site of the SNP to be characterized and the inferred fraction of recombinant chromosomes. In some embodiments, the method further includes computing an average log likelihood (ALnLH) of the major or minor allele based on the sum of the square of the differences between a model of the inferred fraction of recombinant chromosomes within a predetermined distance from a positively selected SNP allele and actual data for inferred fraction of recombinant chromosomes within the predetermined distance from the minor or major alleles of the SNPs to be characterized. In some embodiments, the method further includes comparing the ALnLH with a predetermined value. In some embodiments, the predetermined value is an average ALnLH (Av- ALnLH) value. In some embodiments, the method further includes comparing the ALnLH of the minor allele and the ALnLH of the major allele to the Av-ALnLH value.

[0012] In some embodiments, the invention provides a nucleic acid array that includes SNP probes, where one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by a method of characterizing a SNP by determining a numerical quantity for the major or minor allele of the SNP, where the numerical quantity is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of the SNP for the major or minor allele, and where the analyzing is performed on a genotype that is homozygous for the major or minor allele. In some embodiments, the one or more probes include at least 30%, 70%,or substantially all of non-redundant probe sequences in the array.

[0013] The invention also provides method of performing an array-based SNP assay, comprising: conducting a nucleic acid array-based assay on a nucleic acid sample from a subject, wherein the nucleic acid array is a nucleic acid array that includes SNP probes, where one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by a method of characterizing a SNP by determining a numerical quantity for the major or minor allele of the SNP, where the numerical quantity is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of the SNP for the major or minor allele, and where the analyzing is performed on a genotype that is homozygous for the major or minor allele.

[0014] The invention further provides kits for use in screening a nucleic acid sample for the presence of the major or minor allele of one or more SNPs, where the kit contains a SNP detection reagent and a control nucleic acid sample, and wherein the one or more SNPs are SNPs selected by a method of characterizing a SNP by determining a numerical quantity for the major or minor allele of the SNP, where the numerical quantity is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of the SNP for the major or minor allele, and where the analyzing is performed on a genotype that is homozygous for the major or minor allele..

[0015] A collection of a plurality of different SNP profiles each paired with a specific nucleic acid source, wherein (i) said collection is recorded on a substrate and (ii) the SNP profile includes the genotypes of the SNPs selected according to a method of characterizing a SNP by determining a numerical quantity for the major or minor allele of the SNP, where the numerical quantity is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of the SNP for the major or minor allele, and where the analyzing is performed on a genotype that is homozygous for the major or minor allele. In some embodiments, the substrate is a computer readable medium.

[0016] In another aspect, the invention provides sets of SNPs. In one embodiment, the invention provides a set of SNP alleles wherein one or more of the SNP alleles is weighted by a numerical value that indicates a probability of selective pressure on the one or more SNP alleles, hi some embodiments, the set is stored on a computer database, hi some embodiments, substantially all the SNPs in the set are assigned a numerical value, hi some embodiments, the numerical value is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, wherein the analyzing is performed on a genotype that is homozygous for the major or minor allele, hi some embodiments, at least part of the determining is performed using a computer.

[0017] In some embodiments, the invention provides a subset of SNP alleles selected from a larger set of SNP alleles, where the subset of SNP alleles is selected from the larger set based on numerical values assigned to the SNP alleles in the larger set, and where the numerical values are related to the degree of selective pressure on each of the SNP alleles, hi some embodiments, the assigning is performed at least in part using a computer, hi some embodiments, the numerical values are determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each SNP allele in the subset, wherein the analyzing is performed on a genotype that is homozygous for the major or minor allele, and at least part of the determining is performed using a computer, hi some embodiments, the subset comprises an allele having an ALnLH value that is greater than 2.6 standard deviations away from the Av-ALnLH value of the entire Perlegen SNP allele dataset or the Av-ALnLH of the entire HapMap SNP allele dataset. hi some embodiments, the subset contains a SNP allele that is found within at least 100 kb of a gene selected from the group consisting of IL1RAPL2, FOLHl, KIAA1463, PRRGl, S66645/TYR, TTBK2, AK096379, BC034574, AUTS2, BX537851, COHl, COL4A6, CSEN, GALK2, GLRA2, HCNl, OPHNl, OR4A5, OR4C13, OR4X1, OR5AS1, OR8I2, OR8K1, PROM2, RAPSN, RTNl, SNTGl, PSMC3, SERPINC 1 , ADK, AF035029, AF035035, AP3M1, CSMD3, D63480, F8, FLJ42925, LRBA, SCAPl, SPIl, SUPT3H, NCOA6, AK091585, AK131264, AK131417, AK097440, AK131413, ECM2, POTE8, RNF18, STS, SLC39A13, ZBTB37, ZNF192, ZNF193, BRODL, CSElL, FANCC, PARP4, PNUTL2, TP53INP2, AF318371, BC046415, CPEB3, DJ467N11.1, HGNT-IV-H, HKRl, KHDRBS2, MGC46496, NT5C2, RAD51C, ZCWCC2, ZNF2, ZNF322A, ZNF37A, ZNF514, ZNF569, ZNF570, C15orfl6, C6.1A, MAST2, PPMlE, UBRl, AB037807/CCM1, FLJ14442, HDHDlA, IMMP2L, LOC220594, MTMR4, TEX14, AF318346, AKAP9, CDC91L1, FUNDC2, MAGEDl, SFIl, AB058732, AFl 16680, AK090675, AK125992, ASCC3, BC027488, C10orf68, CPNE8, FAM46D, FAM47A, FLJ32191, FLJ33979, FLJ46156, GBA, GPR23, HPSE2, KIAA0377, LOC492307, LOC51057, MAGEB6, MGC12197, MGC35232, MTCH2, OCIADl, RABGAPlL, SH3BGRL, VCL. In some embodiments, the subset contains at least about 10, 500, or 1500 SNP alleles found within the 100 kb of the gene. In some embodiments, the subset contains at least 500 SNP alleles found within the 100 kb of the gene. In some embodiments, at least 10%, 50%, or substantially all of the alleles in the subset are found within 100 kb of the gene.

[0018] In another aspect, the invention provides methods of determining SNPs with predictive value. In some embodiments, the invention provides a method of determining a subset of SNPS with predictive value for a phenotype in a set of SNPs by (i) determining on case and control samples the relative frequency of one or more alleles of one or more SNPs, where the SNPs are SNPs determined to have a major or minor allele that has a high probability of having been subjected to selective pressure; (ii) for each SNP for which a frequency is determined in step (i), comparing the frequency of the occurrence of the SNP in the case population and in the control population; and (iii) selecting for inclusion in the subset of SNPs those SNPs for which a major or minor allele frequency estimate in the case samples differs from the major or minor allele frequency in the control samples by at least 1.5 standard deviations. In some embodiments, the method is carried out at least in part using a computer. In some embodiments, in step (i) the major or minor alleles are determined to have a high probability of having been subjected to selective pressure by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, wherein the analyzing is performed on a genotype that is selected for analysis on the basis of homozygosity for the major or minor allele. In some embodiments, the case samples are from individuals with one or more phenotypic characteristics of a pathological condition, e.g., a neurodegenerative disease , a psychiatric disease, a metabolic disorder, a cardiovascular disease, an infectious disease, or a cancer. In some embodiments the disease is a neurodegenerative disease, e.g., Alzheimer's disease, Pick's disease, Lewy body dementia, or corticobasal degeneration. In some embodiments, the neurodegenerative disorder is Alzheimer's disease. In some embodiments, the invention provides a database that contains some or all of the data on cases and/or controls, the SNPs investigated and/or selected for inclusion in the set, where the data and/or database may be stored in a computer readable storage medium. The invention also includes sending such data or database or information derived from the data or database via electronic signal, e.g., via the internet, from one location to another. The invention also includes software for analyzing the data and/or database.

[0019] In some embodiments, the invention provides a method of determining a predictive value for a phenotype for a SNP in a set of SNPs, where the SNPs in the set of SNPs each have been assigned a numerical value based on the probability of selective pressure on that SNP, by(i) determining on case and control samples the relative frequency of the major and minor allele of the SNP and comparing the frequency of the occurrence of the major and minor allele of the SNP in the case population and in the control population; and (ii) combining the results of step (i) with the numerical value for the SNP to assign to the SNP a predictive value for the phenotype. In some embodiments, the method is carried out at least in part using a computer. In some embodiments, the probability of selective pressure has been determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, wherein the analyzing is performed on a genotype that is selected for analysis on the basis of homozygosity for the major or minor allele. In some embodiments, the case samples are from individuals with one or more phenotypic characteristics of a pathological condition, e.g., a neurodegenerative disease , a psychiatric disease, a metabolic disorder, a cardiovascular disease, an infectious disease, or a cancer. In some embodiments the disease is a neurodegenerative disease, e.g., Alzheimer's disease, Pick's disease, Lewy body dementia, or corticobasal degeneration. In some embodiments, the neurodegenerative disorder is Alzheimer's disease. In some embodiments, the invention provides a database that contains some or all of the data on cases and/or controls, the SNPs investigated and/or selected for inclusion in the set, where the data and/or database may be stored in a computer readable storage medium. The invention also includes sending such data or database or information derived from the data or database via electronic signal, e.g., via the internet, from one location to another. The invention also includes software for analyzing the data and/or database.

[0020] In a further aspect, the invention provides methods of diagnosis, prognosis, determination of treatment, or determination of status of treatment. In some embodiments, the invention provides a method of determining a diagnosis, prognosis, or status of treatment for an individual, by (i) determining the identity of the alleles for a SNP from a sample obtained from the individual, where each allele of the SNP has been assigned a value based on a quantitative measure of the probability of selective pressure on the SNP, and (ii) determining a diagnosis for the individual based on the identity of the allele or alleles for the SNP. In some embodiments, the method is carried out at least in part with the use of a computer. In some embodiments, the SNP is assigned a weighted value based on the quantitative measure of selective pressure on the SNP, and said diagnosis is based on an analysis of a combination of the weighted value and the identity of the allele or alleles for the SNP. In some embodiments, the identities of the alleles for a plurality of SNPs are determined, where each allele of each the SNPs has been assigned a value based on a quantitative measure of selective pressure on the SNP, and a diagnosis, prognosis, or treatment is determined based on the identity of the alleles for said plurality of SNPs.

[0021] In some embodiments, the invention provides a method of stratifying a population of individuals, where the stratification is based on likelihood of exhibiting a phenotype, and where the method includes (i) determining the identity of the alleles for a SNP for the individuals, wherein each allele of the SNP has been assigned a value based on a quantitative measure of selective pressure on the SNP; and (ii) determining the position for the individual in the stratification based on the identity of the allele or alleles for the SNP. In some embodiments, the method is carried out at least in part using a computer. In some embodiments, the identities of the alleles for a plurality of SNPs are determined for the individual, where each allele of each the SNPs has been assigned a value based on a quantitative measure of selective pressure on the SNP, and the position for the individual in the stratification is determined based on the identity of the alleles for said plurality of SNPs. In some embodiments, the phenotype is response to a treatment, e.g. administration of a drug. In some embodiments, the response comprises a therapeutic response to administration of the drug. In some embodiments, the response comprises a side effect of the drug, e.g., a negative side effect.

[0022] In a yet further aspect, the invention provides business methods. In some embodiments, the invention provides a business method that includes: a) collecting case samples, e.g., more than about 10, 50, 100, 500, or 1000 case samples, representing a clinical phenotypic state and control samples, e.g., more than about 10, 50, 100, 500, or 1000 control samples representing patients without said clinical phenotypic state; b) detecting in each sample the presence or absence of one or more SNP alleles selected from the subset of SNP alleles as described herein or selected by a method as described herein; b) identifying representative patterns of the occurrence of the selected SNP alleles that distinguish datasets from case samples and control samples; c) marketing diagnostic products that use said representative patterns to identify said phenotypic or a predisposition to said phenotypic state with a disposable device; and e) selling said disposable device. In some embodiment, the products are marketed in a clinical reference laboratory. In some embodiments, the marketing step markets kits. In some embodiments, the kits are FDA approved kits. In some embodiments, the phenotypic state is a drug response phenotype and the method further includes the step of collecting a royalty on said drug. In some embodiments, the methods further include the step of collecting said samples in collaboration with a collaborator. In some embodiments, the collaborator is an academic collaborator. In some embodiments, the collaborator is a pharmaceutical company, and in some of these embodiments, the pharmaceutical company collects said samples in a clinical trial. In some embodiments, patterns are used to segregate a drug response phenotype. In some embodiments the methods include collecting a royalty on the drug. In some embodiments, the step of marketing diagnostic products is performed by the same company as the company performing the identifying step. In some embodiments, at least 50, 100, 500, 100, 200, Or 10,000 of the selected SNP alleles are detected. In some embodiments, the marketing step markets a nucleic acid array detection system used to identify said representative states in patient samples. In some embodiments, said diagnostic products use a nucleic acid array platform. In some embodiments, said diagnostic products are marketed with a nucleic acid array. In some embodiments, said diagnostic products are marketed by a diagnostic partner. In some embodiments, the phenotype is a drug response phenotype. In some embodiments, the phenotype is a drug resistance phenotype. In some embodiments, the phenotype is a disease stage phenotype. In some embodiments, the phenotype is a disease state phenotype. In some embodiments, the phenotype is a treatment selection phenotype. In some embodiments, the phenotype is a disease diagnostic phenotype. In some embodiments, the phenotype is a drug toxicity phenotype. In some embodiments the phenotype is an adverse drug response phenotype. In some embodiments revenue is derived from sales of nucleic acid arrays, informatics tools, patterns and/or computer programs for classifying samples and/or from services that provide diagnostic information and/or pattern discovery and/or validation.

[0023] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as is commonly understood by one of skill in the art to which the claimed subject matter belongs. [0024] It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of any subject matter claimed. In this application, the use of the singular includes the plural unless specifically stated otherwise. It must be noted that, as used in the specification and the appended claims, the singular forms "a," "an" and "the" include plural referents unless the context clearly dictates otherwise. In this application, the use of "or" means "and/or" unless stated otherwise. Furthermore, use of the term "including" as well as other forms, such as "include", "includes," and "included," is not limiting.

BRIEF DESCRIPTION OF THE FIGURES

FIG 1 is a schematic depiction a process for finding unusual genetic architectures in which (A) a set of genotypes are selected on the basis of homozygosity for the major or minor allele of a SNP of interest (arrow) to produce (B) a set of genotypes homozygous for the major allele (top) and a set of genotypes homozygous for the minor allele (middle). Afterwards, the distance (d_!-d₃) and the FRC are computed and stored for each SNP neighboring the SNP of interest. The stored values are used to determine the ALnLH for the SNP allele of interest.

FIG 2 is a set of scatter plots depicting the observed FRC versus distance for two minor alleles well established as positively selected alleles: DRD4 7R (left-most curve and inset) and G6PD V202M, rightmost curve. FIG 3 is an example of a SNP, in the Reticulon 1 (RTNl) gene, identified as having a high probability of selection in three separate SNP datasets. (A) Schematic illustration of SNPs in the region of RTNl . (B-

D) are scatter plots for the observed FRC versus distance for an RTNl SNP found in (B) the Perlegen dataset, (C) the HapMap CEU dataset, and (D) the HapMap YRI dataset.

FIG 4 is a distribution plot of the Gene Ontology Category Log(EASE) scores for the 407 identified genes falling within ± 100 Kb of selected SNPs HapMap (CEU). Categories 1-6 are the most overrepresented gene ontology categories for the 407 identified genes.

FIG 5 depicts a series ALnLH profiles over genomic distance: (A) Actual average log likelihood

(ALnLH) distribution for human chromosome 7. (B) Each SNP position on HapMap CEU chromosome 7 data was randomly scrambled, and the ALnLH was calculated. In this model, the population is assumed to have constant effective size n = 10,000 until 5,000 generations before the present, followed by exponential expansion to 5 billion. It is the LD observed in African ancestry (YRI) populations. In this randomized data set, the probability of obtaining an ALnLH score of 0.2 (line) is 0.0005 and 0 for the cut-off zone used in this study (0.71). (C) A total of 162 haplotypes from the randomized data set were

"infused" by computer simulation with 18 copies of a single haplotype. This extreme admixture/bottleneck model represents a population of 90 individuals in which 20% of the chromosomes come from a single individual (10% from each homolog). Subsequent "generations" were simulated to recombine once per chromosome arm per generation. The genotypes were generated in each generation by random assortment, and ALnLH values were calculated (for 50 generations or =40,000 years).

Average fragment size of the original infused haplotype was obtained per generation. An ALnLH above 0.71 was never observed in this extreme simulation.

(D) Ten haplotypes were assembled from actual HapMap CEU genotypes representing a bottleneck of five individuals. Haplotypes were selected to contain the unusual genetic architectures. Again, recombinations were randomly generated, and genotypes and ALnLH values were calculated for 1,000 generations.

ALnLH values of >0.71 rapidly decay to a level 1/10 that initially observed in actual HapMap data by generation 40 (P = 0.0017). After 400 generations, essentially no ALnLH values of >0.71 were observed

(P = 0.00006).

FIG 6 The fragment length decay of an infused haplotype. The average length of an infused haplotype, representing a chromosomal contribution of 20% to a population of 90 individuals (Model in Fig. 5C) is plotted vs. simulated generations.

FIG 7 The genomic architecture surrounding an unusually high ALnLH score (0.6) obtained during admixture simulations. The eight admixture/bottleneck simulation described in Fig. 6C never produced an ALnLH value of >2.6 SD (0.71). Occasional high (>0.5) values were obtained, however. One of these (0.6) is shown, with the minor allele FRC shown as square symbols and the major allele FRC as diamond symbols. Note the lack of LD decay with distance produced by admixture (this figure) in comparison with inferred selection (e.g., Fig. 3). FIG 8 Human chromosome maps of SNPs inferred to have an allele under selective pressure. SNP positions for the Perlegen (PLG) and HapMap (CEU, CHB, JPT, and YRI) data sets are shown. SNP sets corresponding to the marked positions are listed in Table 1 (Perlegen), Table 3 (CEU), Table 5 (CHB), Table 7 (JPT), and Table 9 (YPJ).

FIG 9 is a scatter plot of FRC versus distance for the minor allele (square symbols) and major alleles (diamond symbols) of rs 10801551, a selected SNP in the Perlegen dataset, which is found proximal to complement factor H (CFH), a gene recently linked to macular degeneration by an SNP association study. See Klein et al. (2005), Science, 308(5720):385-389.

FIG 10 is a scatter plot of FRC versus distance for the minor allele (square symbols) and major alleles (diamond symbols) of rsl 1684454, a selected SNP in the HapMap (CEU) dataset, which is found proximal to insulin induced gene 2 (INSIG2), a gene recently linked to obesity by an SNP association study. See Herbert et al. (2006), Science, 312(5771):279-283.

DETAILED DESCRIPTION

I. Methods of Characterizing SNPs

II. Sets of SNPs

III. Association Methods rv^*. Predictive Methods

V. Compositions

VI. Business Methods

[0025] The invention provides methods of characterizing SNPs, sets of SNPs, genes, and/or gene fragments characterized by such methods, reagents and compositions using SNPs and sets of SNPs, genes, and/or gene fragments characterized by such methods, methods of assigning predictive value to characterized SNPs, diagnostic, prognostic, and treatment methods and compositions based on such predictive values, and business methods.

I. Methods of Characterizing SNPs

[0026] The invention provides methods and compositions for characterizing a SNP of interest (SOI) that include determining a quantitative measure of the probability that selective pressure biased the major or minor allele frequencies of an SOI. In some embodiments, the invention provides methods and compositions for characterizing a SOI by sorting genotypes by homozygosity, and/or by examining recombination rates for SNPs with a predetermined distance from the SOI. The characterization of the SNP (e.g., based on its probability of selective pressure, or more empirically simply based on the degree of recombination of SNPs within a predetermined distance of the SOI in a homozygotic genotype) allows refinement of further uses of the SOI, e.g., association studies that may be done with the SOI, or with a set of SOIs that have been characterized in like manner. In some embodiments, association studies may be done using only a subset of SNPs selected from a larger set, where the SNPs are selected based on their characterization by the methods of the invention. For example, SNPs with a high degree of probability of selective pressure are included in the set and those with a low degree are excluded; more empirically, SNPs whose degree of recombination of nearby SNPs in a homozygotic genotype is on one side of a threshold value are included in the set and those on the other side are excluded. Alternatively, each SNP in a set of SNPs (e.g., a genomic set of SNPs) may be weighted based on its degree of probability of selective pressure; more empirically, each SNP may be weighted based on the degree of recombination of nearby SNPs. The weight may be used to determine threshold values for the difference in frequency of an allele in, e.g., an association study, to be considered significant. In this scenario, the entire set of SNPs is used, but SNPs more likely to show functional significance are accorded a greater weight and, typically, a lower threshold value for a difference in allele frequency to be considered significant.

[0027] Further embodiments include sets or subsets of SNPs characterized through the methods of the invention, and genes or gene fragments associated with these SNPs. These sets or subsets may be used in association studies, as described above. The sets or subsets may also be used in, e.g., diagnosis, prognosis, and treatment determination. The sets or subsets of SNPs, genes or gene fragments, or their complements may be contained in compositions, e.g., arrays of nucleic acids and kits containing such compositions, that allow a practitioner to conveniently characterize SNPs from a sample (e.g., a blood sample from an individual, e.g., a patient) and to analyze patterns of SNPs. Methods are provided for analyzing such patterns, and for comparison with a database to determine a diagnosis, prognosis, or treatment.

[0028] SNPs are single base pair positions in DNA at which different alleles, or alternative nucleotides, exist in some population. The SNP position, or SNP locus, is usually preceded by and followed by highly conserved sequences (e.g., sequences that vary in less than approximately 1/100- 1/1000 members of the population). An individual may be homozygous or heterozygous for an allele at each SNP position. SNPs are useful in association studies in which the occurrence of certain SNP alleles in a particular phenotypic population is correlated to (i.e., "associated with") the occurrence of the phenotype (or phenotypes), e.g., a health condition or a response to a drug or other environmental factor. The existence of a preferential occurrence of a functional gene in association with specific alleles of linked markers (e.g., SNPs) is called "Linkage Disequilibrium" (LD).

[0029] It has been hypothesized that common DNA variants underlie many common health conditions and that these high frequency variants can be identified by means of linkage disequilibrium

(LD) to nearby SNPs. See, e.g., Risch et al. (1996), Science, 273: 1516-1517; and Zwick ef al. (2000), Annu. Rev. Genomics. Further, it has been inferred that these common, "disease-associated" DNA variants (along with closely linked SNPs) likely reached polymorphic frequencies in the human population due to their association with phenotypes that have undergone positive selection, e.g., that have undergone positive selection relatively recently (less than about 50,000 years ago) for certain phenotypes (e.g., fat storage). However, those positively selected phenotypes now interact with environmental factors peculiar to life in modern society (e.g., high fat diet, high stress, and extended longevity) and thereby now contribute to a wide variety of common health problems. See, e.g., Zwick et al supra; and Ding et al. (2002), Proc. Natl. Acad. ScL USA, 99:309-314.

[0030] Thus, without wishing to be bound by theory, it is predicted that the SNPs most likely to be prognostic or diagnostic for common health conditions, or for specific interactions with the modern environment (e.g., predictive of a drug response) will be the subset of SNPs closely linked to genes/genomic regions having a high probability of positive selection, e.g., recent positive selection. See Wang et al (2006), Proc. Natl. Acad. ScL USA, 103(1): 135-140. Given the limitations of population genetics tests to date, however, it has not been feasible to generate a quantitative measure, for each and every SNP of interest (SOI), of the probability that positive selection biased one of its allele frequencies.

[0031] The invention provides methods of characterizing SNPs that are thought to indicate the probability of selective pressure on the SNP. However, it is understood that the methods provide empirical means to characterize SNPs that are useful whether or not the characterization indicates the probability of selective pressure on the SNP.

[0032] In some embodiments, the methods described herein include selecting an SOI from a set of mapped SNPs for which genotypes have been determined in a group of individuals. The methods typically involve the use of genotypes from individuals in a population who are homozygotic for the SOI. It will be appreciated that the frequency of the minor allele of the SOI to be studied in the population should be high enough to allow for enough homozygotic individuals within the population to carry out the methods. Thus, the larger the population that is being analyzed, the lower the frequency of the minor allele may be for the methods to produce meaningful results. One of skill in the art can readily determine the cutoff necessary to produce results of the desired significance in a population of a given size. Thus, in some embodiments, the SOI has a minor allele frequency of at least about 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 6%, 7%, 8%, 9%, 10%, 12%, 14%, 16%, 18%, 20%, 25%, 30%, 35%, 40%, or 45%. In some embodiments, the SOI has a minor allele frequency of at least about 1%. In some embodiments, the SOI has a minor allele frequency of at least about 5%. In some embodiments, the SOI has a minor allele frequency of at least about 10%. In some embodiments, the SOI has a minor allele frequency of at least about 20%.

[0033] In some embodiments, the mapped SNPs for which genotypes have been determined in a group of individuals are SNPs from a chromosomal region. In other embodiments, the mapped SNPs are SNPs from an entire chromosome. In further embodiments, the mapped SNPs are SNPs from the entire genome (e.g., a human genome). For example, human genome-wide SNP genotype data sets (including allele frequency and allele zygosity for each SNP) are available to the public through several SNP mapping projects. See, e.g., the Perlegen SNP map data (over 1.5 million SNPs genotyped in 71 individuals), Hinds et al. (2005), Science, 307:1072-1079, which is accessible through the world wide web at genome.perlegen.com. See also the International Haplotype Map ("HapMap") project data (1.1 million SNPs genotyped in 270 individuals), The International HapMap Consortium (2005), Nature, 437: 1299-1320, which is accessible through the world wide web at hapmap.org. It will be appreciated that the source of the SNP data is not crucial, and that as more detailed data sets become available (e.g., data sets that include more or different sets of SNPs, and/or data sets that include genotypes from more individuals and/or individuals of different population groups), the methods of the invention may be used to further refine characterization of the SNPs in the database. For example, as noted above, a database with genotype data from a larger number of individuals allows SNPs whose minor allele occurs at lower frequencies to be examined. Further examples of the effects of the data included in the data set are described elsewhere herein.

[0034] Any suitable method for determining a quantitative measure of the probability of selective pressure on a SOI may be used in the methods of the invention. In some embodiments, the methods include selecting genotypes for analysis that are homozygous for the major or minor allele of the SOI, so as to obtain one set of selected genotypes that are homozygous for the major allele of the SOI, and another set of genotypes that are homozygous for the minor allele of the SOI. Genotypes selected according to this SOI-homozygosity criterion may then be analyzed for linkage decay within a predetermined distance upstream and/or downstream from the SOI.

[0035] For example, genotypes selected according to this SOI-homozygosity criterion may be scored for major/minor allele zygosity at SNP loci mapped to various distances upstream or downstream ( ± d_1; &i, d₃, d_x) from the SOI locus (i.e., "adjacent SNP loci"). In some embodiments, zygosity is determined for all or at least a portion (e.g., at least about 10%, 20%, 30%, 40%, 60%, 70%, 80%, or 90%) of the adjacent SNPs mapped within a distance (e.g., a predetermined distance) of about ± 10 kb (e.g., about ± 20 kb, 30, 40, 50, 80, 100, 200, 300, 400, or 500 kb) from the SOI locus. In some embodiments, the distance is about + 50 kb from the SOI locus. In some embodiments, the distance is about ± 100 kb from the SOI locus. In some embodiments, the distance is about ± 200 kb from the SOI locus. In some embodiments, the distance is about ± 300 kb from the SOI locus. In some embodiments, the distance is about ± 400 kb from the SOI locus. In some embodiments, the distance is about ± 500 kb from the SOI locus. The distance may be selected based on the density of SNPs in the database being used and the degree of significance for probability of selective pressure that is acceptable. All other things being equal, a database with higher average SNP density allows for a shorter distance from the SOI to give the same likelihood of significance as a longer distance in a database with a lower SNP density. This distance is a function of the average number of adjacent SNPs found in the distance in the genome or portion of the genome examined. In some embodiments, the distance is selected so that the average number of adjacent SNPs found within the distance from the SOI is at least about 10, 20, 30, 40, 50, 75, 100, 125, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 1500, 2000 or more than 2000. For example, the average number of SNPs may be from about 10-2000 SNPs. [0036] This procedure is then reiterated for at least a portion (e.g., at least about 10%, 20%, 30%, 50%, 70%, 80%, 90%, or 100%) of the genotypes in the above -described sets of selected genotypes. For each of the adjacent SNP loci that are heterozygous, a chromosome recombination event is inferred. Thus, an inferred fraction of recombinant chromosomes (FRC) can be computed for any of the adjacent SNP loci. The inferred FRC for a given adjacent SNP locus can then be calculated simply as the total number of genotypes heterozygous for the adjacent SNP locus divided by the number of chromosomes analyzed in the selected genotype set (i.e., two chromosomes/ genotype), as illustrated in the example set forth in Figure 1.

[0037] For each adjacent SNP locus SNP_n, its distance away from the SOI locus (d_n) and its associated FRC (FRC_n) is then recorded as a value pair into a list. See, e.g., Fig. 2B. As described herein, ordering a set of (d_n, FRC_n) value pairs by distance provides a measure of linkage disequilibrium as a function of distance from the SOI , which is referred to herein as linkage disequilibrium decay (LDD). In some embodiments, value pair lists are generated for both the major and minor allele of the SOI. [0038] While not wishing to be bound by theory, LDD from a positively selected SNP allele locus is expected to be relatively "shallower" than LDD from a selectively "neutral" SNP site or from the genomic average LDD. Indeed, this is the case, e.g., for the G6PD-V202M allele, which, based on several criteria, is generally considered to have undergone positive selection. See, e.g., Sabeti et al. (2002), Nature, 419:832-837. Thus, as described herein, SNP loci previously established as having undergone positive selection (e.g., the G6PD V202M locus) can serve to construct a computerized test model of the expected LDD for an SNP allele under positive selective pressure (PSML). [0039] As will be apparent to one of skill in the art, any suitable model to describe progressive decay or any other pattern of linkage disequilibrium surrounding a selected allele can be used in the methods of the invention. In some embodiments, the PSML is approximated by a standard sigmoidal curve, which is consistent with prior work on allele age calculations and the acknowledgment that inferred recombination has a maximum value of 0.5. See, e.g., Sabeti et al. supra, Ding et al. Supra; Wang et al. (2004), Am. J. Hum. Genet., 74:931-944; Serre et al. (1990), Hum. Genet., 84:449-454.; Slatkin et al. (2000), Annu. Rev. Genomics Hum. Genet., 155: 1405-1413; or Fay et al. (2000), Genetics, 155:1405- 1413. Nevertheless, the PSML can also be approximated by various linear or exponential curves (depending on the assumptions made regarding recombination), as will be apparent to the skilled artisan. Further suitable models and statistical methods of use in the invention may be found in, e.g., P. Baldi and S. Brunak. Bioinformatics: the Machine Learning Approach. MIT Press, (1998). Second revised edition (2001). [0040] In one embodiment, the PSML is approximated as a sigmoidal curve according to Equation 1 :

(Eq. 1)

The regression model based on Eq. 1 relates the FRC on the y-axis to distance from a SNP locus on the x-axis, where β > 0, λ > 0 for the right-side model and λ < 0 for the left-side model, with e representing Gaussian additive noise with mean 0 and variance a². For a given SNP locus or group of loci, the shape and offset parameters λ and β can be estimated by maximum likelihood methods. Alternatively, maximum a posteriori methods can be used using appropriate priors on λ and β, as will be appreciated by those of ordinary skill in the art. E.g., it is possible to put a background prior distribution on one or more of these parameters that comes from available polymorphism (e.g., SNP) data, e.g., the Perlegen or HapMap datasets or any subset thereof. In Eq. 1 , λ is inversely proportional to the point of inflection for the sigmoid, β is directly related to the y-intercept, and σ represents the mean genome deviation. The deviation term allows for both experimental error and deviations in local recombination rate. See, e.g., Kong et al. (2002), Nat. Genet., 31:241-247. As in the case of λ and β, the deviation parameter σ can be estimated on a local level at the level of a single SNP or more globally at the level of multiple SNPs (e.g., SNPs that are colocated or SNPs that are geographically separated but that are grouped by some characteristic, such as function), chromosomal regions, entire chromosomes, or on a genome-wide scale. [0041] Any suitable allele for which positive selection has been established may be used as the basis of the PSML. In one embodiment, the PSML is based on the observed LDD for the G6PD V202M allele, for which positive selection is well established. See Sabeti et al. supra.

[0042] Once a PSML is generated, a set of statistical comparisons can be made to obtain a quantitative measure of the probability that selective pressure biased an allele frequency of the SOI. Any suitable statistical comparison may be used, e.g. For example, three sets of statistical comparisons can be made. In the first set, average log likelihood (ALnLH), a statistical measure of fit between actual LDD data and a PSML, is determined based on the sum of the square of the differences between actual LDD data and the PSML. The AnLH is determined for one or both SOI alleles. In addition, the ALnLH is determined for the LDD of every SNP locus in the entire dataset (e.g., every SNP allele in a genome-wide dataset), and the average ALnLH (Av-ALnLH) for the entire dataset is computed. [0043] An ALnLH can be computed, e.g., according to Equation 2:

Ln(P(D I M)) oc 2 LnP(Y \ F(X ) ))

Eq. 2

[0044] In Eq. 2, D refers to the allele being compared, M is the PSML value, N is the number of adjacent SNP loci within a distance of at least ± 10 kb (e.g., at least ±20, 30, 40, 50, 80, 100, 200, 300, 400, or 500 kb) from the SOI locus. X is the vector containing the distance (in bp) of each ith adjacent SNP locus to the SOI (i.e., X; is the distance of adjacent SNPj from the SOI locus). F(X_;) is the expected frequency according to the PSML. All values of X are weighted equally. As will be apparent to one skilled in the art, additional standard statistical metrics can be computed including p values, confidence intervals and posterior distributions.

[0045] In a second statistical comparison, the standard deviation of an SOI allele ALnLH from the Av- LnLH is determined. Without wishing to be bound by theory, the probability that an SOI allele frequency was biased by selective pressure is directly correlated with the magnitude of the standard deviation of the SOI allele ALnLH from the Av-ALnLH. This is because the majority of the SNPs in a database are not subject to selective pressure, so the Av ALnLH reflects a value biased toward non- selection. Thus, the further the ALnLH of a SOI deviates from the Av-ALnLH, the more likely it is that the SOI has been subject to selective pressure.

[0046] While the method has been described in terms of ALnLH and AvLnLH, it will be appreciated that any method that ascribes a quantitative measure of the probability of selective pressure on an SOI may be used. The SOI may be characterized by the quantitative measure.

[0047] In a third, optional, statistical comparison, the standard deviation of an SOI allele ALnLH from the Av-LnLH is compared to the standard deviation of the ALnLH of the alternate allele of the SOI from the Av-LnLH. This comparison is useful in that it provides a measure of the local recombination rate relative to the average (e.g., a genome average). Without wishing to be bound by theory, it is expected that the LDD of a positively selected SOI allele will deviate significantly from the average LDD, but the LDD of the alternate allele will not.. See, e.g., Fig. 3. Thus, the SOI alleles most likely to have selective pressure are those for which the LDD differs significantly from the average LDD and from the LDD of the alternate allele. Thus, in some embodiments, a further selection criterion may be that the alternate allele be no more than, e.g., about 3, 2, 1.5, 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1 or fewer SD away from the Av-ALnLH value. However, this third statistical comparison is optional and is not necessary to the method. It provides a means to, e.g., eliminate false positives. [0048] One of ordinary skill in the art will appreciate that for any probabilistic model, e.g., a sigmoidal pattern with a Gaussian additive, goodness of fit of observed LDD with the PSML can be measured with statistical tests other than or in addition to ALnLH, e.g., data likelihood, log likelihood, as well as parameter posterior distribution, or model parameter estimation.

[0049] In some embodiments, a statistical threshold is set for detection of SOI alleles considered to have a high probability of selective pressure, e.g., recent selective pressure (about 50,000 years ago or less). For example, the detection threshold can be set using the LDD of an established positively selected SNP allele, e.g., the G6PD V202M allele, as a "test case" for significant deviation from the average LDD, as would be expected under recent positive selection. Accordingly, if the ALnLH of the test case allele falls 2.6 standard deviations (SD) away from the Av-ALnLH, then 2.6 SD would be set as a detection threshold for determining an SOI allele to have a high probability of selective pressure. An additional, optional, statistical threshold for evaluating the significance of an SOI allele can be set based on the deviation of the alternate allele ALnLH from the Av-ALnLH. For example, if the ALnLH of an SOI allele is >2.6 SD away from the Av-LnLH, but the ALnLH of the alternate SOI allele is less than 1 SD from the AvLnLH value, then there is a high probability that the SOI allele was under selective pressure. Conversely, if the ALnLH of the SOI alternate allele falls greater than 1 SD away from the Av- ALnLH, it is less probable that the SOI allele is under selective pressure. Instead, this scenario would be more likely explained by a slower local recombination rate, i.e., a factor that, unlike selective pressure, would equally affect both SOI alleles. These two deviations may be combined in a single metric, e.g., by computing their difference or their ratio.

[0050] While it is important to set a threshold high enough to exclude spurious positive results, it should not be so stringent that minor deviations from the PSML are excluded. Accordingly, in some embodiments the statistical threshold of the deviation of the SOI allele ALnLH from the Av-ALnLH is set lower, e.g., at 2.5 SD, 2.4 SD, 2.2 SD, 1.8 SD, 1.5, 1.3, 1.2, or 1 SD away from the Av-LnLH. [0051] In some embodiments, the statistical threshold for deviation of the SOI allele ALnLH from the Av-ALnLH is set without reference to a particular test case allele (e.g., the G6PD V202M allele). Any desired threshold may be used, based on, e.g., the degree of selectivity desired.. By way of example only, the SOI allele ALnLH SD threshold can be set at a pre-determined value of, e.g., about 6, 5, 4, 3.5, 3.4, 3.3, 3.2, 3.1, 3.0, 2.9, 2.8, 2.7, 2.6, 2.5, 2.4, 2.3, 2.2, 2.1, 2.0, 1.9, 1.8, 1.7, 1.6, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, or less than 0.5 SD away from the Av-ALnLH value.

[0052] In some embodiments, the methods described herein include characterizing, using a computer- automated system, all of the SNPs in an available dataset (e.g., the Perlegen genome-wide SNP set) or a portion thereof (e.g., any portion less than 100%) and storing the results in a computer database, using the methods described herein. The SNP loci to be analyzed can be selected based on one or more criteria, e.g., by minor allele frequency, map position, physical position, or specific population. [0053] In one embodiment, the selected SNP alleles are assigned to various categories based on statistical thresholds for the deviation of an SOI-allele ALnLH from the Av-ALnLH and deviation of the SOI alternate allele ALnLH from the Av-ALnLH as described herein. For example, the SNP alleles to be characterized can be analyzed and selected or not selected for inclusion in a set of SNP alleles based on a large deviation of the allele ALnLH from the Av-ALnLH indicating, e.g., ALnLH of the SOI allele > 2.6 SD Av-ALnLH. Alternatively, the SNP alleles to be characterized can be subdivided into more than two categories based on ranges of ALnLH deviation from the Av-ALnLH.

[0054] The use of a threshold value for including or not including SNPs in a subset of SNPs is merely one way of characterizing the SNPs in terms of their likelihood of selective pressure. In another embodiment, some or all of the SNPs selected for analysis are given a weight based on a continuum from the highest deviation of ALnLH from Av-ALnLH (highest probability of selective pressure) to the lowest deviation of ALnLH from Av-ALnLH (lowest probability of selective pressure). For example, each SNP can be given a weight from 0 (no deviation from Av-ALnLH) to 1 (highest deviation from Av-ALnLH). [0055] In some embodiments SNPs selected as described herein, can be queried into a SNP database, e.g., the National Center for Biotechnology Information SNP Database, which can be accessed on the world wide web at ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp, to identify genes found within a predetermined distance of each SNP, e.g., within about 20 kb (e.g., within about 30, 40, 60, 80, 100, 120, 150, 200, 300, or 500 kb) of any of the selected SNPs ("proximal genes"). [0056] For example, the SNP allele selection criterion can be set such that the selection threshold for an SNP allele ALnLH deviation is 2.6 SD from Av-ALnLH where the Av-ALnLH is based on the entire Perlegen SNP dataset or HapMap SNP dataset (or a subset of the HapMapSNP dataset of data from individuals of European, Han Chinese, Japanese, or African ancestry). These selection criteria identify 25,386 SNPs from the Perlegen SNP set for which one allele met the following criteria: an ALnLH score greater than 2.6 SD away from the Av-LnLH for the entire Perlegen dataset (1.6 million SNPs total) and the alternate that was less or equal to 1 SD from the Av-LnLH (Table 1 — in some cases the SNP does not have a chromosome annotation) . This set may be further analyzed to produce a subset of 11,330 SNPs , based on proximity to a gene within a 100 Kb radius (Table 2~for many SNPs in the list, a known (annotated) gene is listed along with its Gene Accession No. For others, where no Gene Symbol is listed, the implication is that a locus within the 100 Kb-proximal region has been "detected" as a likely gene (e.g., with a gene/sequence search algorithm), but either has not been annotated or has not been confirmed as a gene in the current databases). Similarly, Table 3 is a list of 8,903 SNPs specific to the HapMap European Ancestry individuals (CEU), identified using the same criteria for Table 1, and Table 4 is a subset of 8,317 SNPs from Table 3 that fall within 100Kb of a gene; Tables 5 and 6 are a list of 8,386 SNPs specific to the HapMap Han Chinese individuals (CHB) and 8,249 in the ± 100Kb set, respectively; .Tables 7 and 8 are a list of 10,000 SNPs specific to the HapMap for Japanese individuals (JPT) and 9,846 in the ± 100Kb set, respectively; and Tables 9 and 10 are a list of 9,165 SNPs specific to the HapMap for African (Yoruba) ancestry individuals and 8,983 in the ± 100Kb set, respectively, (e.g., the Reticulon gene; see Fig. 3). [0057] In addition, Table 11 is an illustrative set of 500 SNPs from the latest Hap Map Release (4 million SNP set) that has been selected by the 2.6 SD criteria etc. It shows ALnLH scores for the major and minor allele for each SNP. Also listed for each of these is an identified gene in the ± 100Kb region. Table 12 is a list of 1,092 SNPs found in BOTH the Perlegen and HapMap SNPs. All of these fall within ± 100Kb of a gene. Of those, 123 are known/annotated genes (listed in one of the columns). The invention encompasses methods and compositions utilizing the sets of SNPs and genes listed in any or all of Tables 1-12.

[0058] In some embodiments, genes near which many SNPs are clustered are analyzed for overrepresentation using an algorithm such as the one employed in, e.g., the EXPRESSION ANALYSIS SYSTEMATIC EXPLORER (EASE) package. See Hosack et al. (2003), Genome Biol. 4(10):R70. EASE uses a robust version of the Fisher Exact Probability Test. It provides a probability for obtaining x number of genes in category y in a list of identified SNP-proximal genes as compared to obtaining randomly the same number of genes in category y from the whole human genome. The genes represented in the genes sets thus selected, e.g., the Perlegen gene set and the HapMap set (approximately 1800 genes/set) each represent about 8% of the total human genes. Thus, if the gene set is representative of genes that are particularly likely to be implicated in various disease or pathological states, then genes found in association studies of disease or pathological states should be overrepresented in these sets, i.e., represented at a rate greater than by chance. In two recent association studies, several SNPs identified by genotyping on Affymetrix IOOK "SNP Chip" showed strong association with macular degeneration and obesity, respectively. The two identified candidate genes complement factor H (CFH), and insulin-induced gene 2 (INSIG2) were also in strong LD with several of the SNP markers identified by the methods described herein. These markers include, e.g., rslO8O1551 identified within the Perlegen set (see Fig. 9), and rsl 1684454 identified in the CEU HapMap set (see Fig. 10). Given that each set is a selection of 8% of human genes, the odds of identifying both CFH and INSIG2 by chance is approximately 8% x 8%, i.e., less than 1%. Thus, the subsets of SNPs identified by the methods of the invention are enriched for those proximal to genes implicated in disease processes.

II. Sets of SNPs

[0059] It will be appreciated that the methods of characterizing SNPs provided herein may be used to provide sets of SNPs that are useful in a variety of association, diagnostic, prognostic, treatment, and other uses. Such sets of SNPs may be selected with the aid of a computer, and/or stored on a computer database. Composition, e.g., probe arrays, containing all or a portion of such sets and subsets of SNPs are also provided by the invention, and find use in research, clinical, diagnostic, prognostic, treatment, and other uses. Kits containing such compositions are further provided by the invention. [0060] For example, in some embodiments, the invention provides a set of SNPs where one or more of the SNP alleles are weighted by a numerical value that indicates a probability of selective pressure on the one or more SNP alleles; the set may be stored on a computer database. The numerical value may be determined, e.g., by any of the methods described herein. In some embodiments, substantially all the SNPs in the set are assigned a numerical value. For example, some or all of the SNPs in the set may be weighted by a numerical value determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, where the analyzing is performed on a genotype that is homozygous for the major or minor allele. Such a set may be stored as a database, for example, as an electronic database stored on a computer.

[0061] In some embodiments, the invention provides a subset of SNP alleles selected from a larger set of SNP alleles, wherein the subset of SNP alleles is selected from the larger set based on numerical values assigned to the SNP alleles in the larger set, wherein said numerical values are related to the degree of selective pressure on each of the SNP alleles. In some embodiments, said assigning is performed at least in part using a computer. The numerical value may be determined, e.g., by any of the methods described herein. In some embodiments, the numerical values are determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each SNP allele in the subset, wherein the analyzing is performed on a genotype that is homozygous for the major or minor allele. For example, the subset may comprise an allele having an ALnLH value that is greater than 2.6 standard deviations away from the Av-ALnLH value of the entire Perlegen SNP allele dataset or the Av-ALnLH of the entire HapMap SNP allele dataset. An exemplary subset of 25,386 SNPs from the Perlegen SNP set for which one allele met the following criteria: an ALnLH score greater than 2.6 SD away from the Av-LnLH for the entire Perlegen dataset (1.6 million SNPs total) and the alternate that was less or equal to 1 SD from the Av-LnLH. , is provided in Table 1 (provided as a file encoded on a CD attached herewith), which lists RefSNP (rs) ID numbers and the chromosome number location for each of the subset of SNPs. The sequence and other information for any rs- identified SNP can be accessed on the world wide web through the SNP database of the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov/ ); pulldown menu = "SNP."

[0062] In some embodiments, the subset contains an SNP allele that is found within at least 100 kb of a gene selected from the group consisting of IL1RAPL2, FOLHl, KIAA 1463, PRRGl, S66645/TYR, TTBK2, AK096379, BC034574, AUTS2, BX537851, COHl, COL4A6, CSEN, GALK2, GLRA2, HCNl, OPHNl, OR4A5, OR4C13, 0R4X1, OR5AS1, OR8I2, 0R8K1, PROM2, RAPSN, RTNl, SNTGl, PSMC3, SERPINCl, ADK, AF035029, AF035035, AP3M1, CSMD3, D63480, F8, FLJ42925, LRBA, SCAPl, SPIl, SUPT3H, NCOA6, AK091585, AK131264, AK131417, AK097440, AK131413, ECM2, POTE8, RNF18, STS, SLC39A13, ZBTB37, ZNF192, ZNF193, BRODL, CSElL, FANCC, PARP4, PNUTL2, TP53INP2, AF318371, BC046415, CPEB3, DJ467N11.1, HGNT-IV-H, HKRl, KHDRBS2, MGC46496, NT5C2, RAD51C, ZCWCC2, ZNF2, ZNF322A, ZNF37A, ZNF514, ZNF569, ZNF570, C15orfl6, C6.1A, MAST2, PPMlE, UBRl, AB037807/CCM1, FLJ14442, HDHDlA, IMMP2L, LOC220594, MTMR4, TEX14, AF318346, AKAP9, CDC91L1, FUNDC2, MAGEDl, SFIl, AB058732, AF116680, AK090675, AK125992, ASCC3, BC027488, C10orf68, CPNE8, FAM46D, FAM47A, FLJ32191, FLJ33979, FLJ46156, GBA, GPR23, HPSE2, KIAA0377, LOC492307, LOC51057, MAGEB6, MGC12197, MGC35232, MTCH2, OCIADl, RABGAPlL, SH3BGRL, VCL.

[0063] In some of these embodiments, the subset contains at least about 10 SNP alleles found within the 100 kb of the gene. In some of these embodiments, the subset contains at least about 500 SNP alleles found within the 100 kb of the gene. In some of these embodiments, the subset contains at least about 1500 SNP alleles found within the 100 kb of the gene, hi some of these embodiments, at least about 10% of the alleles in the subset are found within 100 kb of the gene. In some of these embodiments, at least about 50% of the alleles in the subset are found within 100 kb of the gene, hi some of these embodiments, substantially all of the alleles in the subset are found within 100 kb of the gene.

[0064] Further subsets of genes include those listed in Tables 1-12 as described above. These subsets of SNPs and genes associated with SNPs are exemplary only, and represent subsets found by setting the selection criteria at certain values, e.g., 2.6 SD from Av-ALnLH and/or 100 kb from a SNP selected by this criterion. The invention includes sets and subsets of SNPs and genes selected by setting the selection criteria at any suitable value, as determined by the skilled practitioner, to produce a desire degree of selectivity and/or certainty regarding the subset. Exemplary values of selection criteria are described herein, and it will be appreciated that combinations of values can produce different subsets, depending on the desired selectivity. Thus, for example, a subset of SNPs may be selected from a set of SNPs by setting the selection threshold at, e.g., about 6, 5, 4, 3.5, 3.4, 3.3, 3.2, 3.1, 3.0, 2.9, 2.8, 2.7, 2.6, 2.5, 2.4, 2.3, 2.2, 2.1, 2.0, 1.9, 1.8, 1.7, 1.6, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, or less than 0.5 SD away from the Av- ALnLH value. To determine a set of genes, the subset of SNPs may then be used to examine a genome to identify genes found within a predetermined distance of each SNP, e.g., within about 20 kb (e.g., within about 30, 40, 60, 80, 100, 120, 150, 200, 300, or 500 kb) of any of the selected SNPs, and these genes may be included in a set of genes. III. Association Methods

[0065] The polymorphic profile for SNPs selected by the methods described herein can be used in association studies. That is, by determining polymorphic individuals in a population of individuals, each of whom has been characterized for the presence or absence of one or more phenotypic traits, one can determine which polymorphic forms, alone or in combination, are correlated with the trait. Alternatively, once a correlation of traits with polymorphic forms has been performed, determination of a polymorphic profile in an individual can be used to predict the presence of a phenotype or the likelihood that a phenotype will develop.

[0066] In some embodiments, the methods described herein include determining the predictive or diagnostic value of an SNP allele, characterized or selected as described herein, i.e., for a biological/phenotypic state of interest. The predictive value of the presence of a particular SNP allele for a phenotypic state is commonly referred to as its "association" with the phenotypic state. [0067] The methods of characterizing SNPs of the invention allow selection of subsets of SNPs that are more likely to be biologically significant in association studies. Alternatively, the methods allow weighting of SNPs according to their likely biological significance. In either case, the characterization of SNPs by the methods of the invention allow a greater degree of selectivity in association studies by de- emphasizing SNPs that are unlikely to have functional significance and emphasizing SNPs that are likely to have functional significance. In this way, differences between cases and controls in allele frequency can be more selectively screened, and false positives can be reduced, allowing a small difference in allele frequency that may nonetheless be significant to be identified between cases and controls that would otherwise be lost in the "noise" of false positive results. Additionally, sets of SNP alleles selected or characterized by the methods of the invention may be examined en masse and differences in patterns discerned between cases and controls. The process may be refined in repeated iterations as alleles are identified as highly significant in different studies.

[0068] An association study involves determining the frequency of the SNP allele in a number of subjects (e.g., at least about 10, 20, 30, 40, 50, 75, 100, 200, 300, 400, 500, 1000, 2000, 3000, 5000, or more than 5000 subjects) with the phenotypic state of interest ("cases"), and preferably an equal number of controls of similar age and race ("controls"). Association studies may be performed in populations that are homogeneous or largely homogeneous for one or more characteristics, e.g., for location of ancestral geographical origin. The latter is especially useful in that groups from various ancestral locations, e.g., European, African, and Asian, are identified in some SNP databases, e.g., the HapMap database, and such groups also have different allele frequency profiles for some alleles. Multiple association studies may be conducted on such different homogeneous or largely homogeneous populations, and the results compared to determine differences among the groups. [0069] Significant associations between particular SNPs and one or more phenotypes can be determined by standard statistical methods. See, e.g., Balding (2006), Na/. Rev. Genet., 7(10):781-791; Carleton et al. (2006), Hum. Genomics, 2(6):391-402; Fornage et al. (2005), Methods MoI Med., 108:159-72.

[0070] In some embodiments, association studies are performed to determine a subset of SΝPS with predictive value for one or more phenotypes in a set of SΝPs by (i) determining on case and control samples the relative frequency of one or more alleles of one or more SΝPs, where the SΝPs are SΝPs determined to have a major or minor allele that has a high probability of having been subjected to selective pressure; (ii) for each SΝP for which a frequency is determined in step (i), comparing the frequency of the occurrence of the SΝP in the case population and in the control population; and (iii) selecting for inclusion in the subset of SΝPs those SΝPs for which a major or minor allele frequency estimate in the case samples differs from the major or minor allele frequency in the control samples by a predetermined threshold. In some embodiments, the threshold is some multiple of the standard deviation of the mean for the samples, e.g., at least about 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.2., 2.4, 2.6, 2.8, 3.0, or more than 3.0 standard deviations different. In some embodiments, the threshold is a percentage difference between the mean of the control and the mean of the case sample, e.g., at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 18, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 80, or 90% different.. It will be appreciated that the more stringently the SΝPs used in the association study have been selected, the more likely it is that a lower value for a difference between case and control samples will select for those SΝPs that have a predictive value; e.g., the use of SΝPs that exhibit a high probability of selective pressure may allow a lower threshold to be selected for the difference between case and control sample that yields a subset of SΝPs with a high degree of predictiveness for the phenotype(s) of interest. In some embodiments, said method is carried out at least in part using a computer. In some embodiments, the SΝPs are SΝPs determined to have a major or minor allele that has been selected by analyzing inferred frequency of recombination of a plurality of SΝPs within a predetermined distance from the site of each of the SΝPs, where the analyzing is performed on a genotype that is selected for analysis on the basis of homozygosity for the major or minor allele.

[0071] In other embodiments, the invention provides methods of determining a SΝP's predictive value for a phenotype, where the SΝP is part of a set of SΝPs, each of which has been assigned a numerical value based on the probability of selective pressure on that SΝP, by (i) determining on case and control samples the relative frequency of the major and minor allele of the SΝP and comparing the frequency of the occurrence of the major and minor allele of the SΝP in the case population and in the control population; and (ii) combining the results of step (i) with the numerical value for the SΝP to assign to the SΝP a predictive value for the phenotype. The numerical value may be assigned by any of the methods described herein, e.g., by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, where the analyzing is performed on a genotype that is selected for analysis on the basis of homozygosity for the major or minor allele. [0072] In the latter embodiments, the predictive value of the SNP for the phenotype(s) will be highest when both the numerical value based on the probability of selective pressure on that SNP and the difference in allele frequency between cases and controls is highest. For example, where a SNP has a numerical value based on the probability of selective pressure on that SNP of 0.99 (with 1 being highest probability), and where case and control samples differ by 90% in frequency of an allele for that SNP, the SNP will be expected to have an extremely high predictive value for the phenotype. The opposite is true for SNPs with both a low probability of selective pressure on that SNP and a small difference in allele frequency between cases and controls; such a SNP would be expected to have a low predictive value for the phenotype(s). The method can be used to select SNPs for inclusion in a subset of predictive SNPs, based on their expected predictive value (e.g., SNPs of greatest predictive value are included in the set and those of least predictive value are excluded). Alternatively, all or substantially all of the SNPs examined by the method may be ranked based on their expected predictive value. It will be appreciated that such rankings may be compared to the actual predictive value in populations on whom the predictions are tested, and the method refined based on the outcome of such studies.

[0073] In some embodiments, patterns of SNP alleles may be studied for their association with a phenotypic state. Bioinformatics systems are utilized to identify the differences in the SNP allele patterns in the case and control samples. Such techniques may be preceded by various data cleanup steps. Patterns are composed of the allele identity for a plurality of SNPs characterized by the methods of the invention, the collective profile of which is more important than the presence or absence of any specific allele. A simplified exemplary pattern might be as follows: for SNPs 1, 2, and 3, cases tend to have allele A for SNP 1 at higher frequency than controls. Conversely, less or no difference is observed between cases and controls for allele B of SNP 2. By contrast, for SNP 3, allele C tends to be at lower frequency in cases than in controls. Of course, the patterns described here are greatly simplified, and there will be much more complex patterns in actual practice, such as tens, hundreds, or thousands of such difference between cases and controls for various SNPs characterized as described herein, hi this particular example, allele B of SNP 2 is not informative, while allele A of SNP 1 tends to occur in cases, and allele C of SNP 3 tends to occur in controls. Automated systems will generally be applied in the identification of the patterns that distinguish cases and controls. The "preselection" of SNPs, or the weighting of SNPs, by the methods described herein, coupled with measurement of patterns of multiple allele frequencies at multiple SNPs enables the identification of subtle differences in biological state and make the identification of that state more robust and less subject to biological noise.

[0074] Examples of Phenotypic States for Use in Association Studies Case samples are obtained from individuals with a particular phenotypic state of interest, or a set of such states (also referred to as a "biological state" herein). Examples of phenotypic states include phenotypes resulting from pathology, disease, aging, injury, an altered environment, drug treatment, genetic manipulations or mutations, change in diet, or any other characteristic(s) of a single organism or a class or subclass of organisms. [0075] In some embodiments, a phenotypic state of interest is a subject's response to a particular pharmaceutical agent. Clinical trials have shown that patient response to treatment with pharmaceuticals is often heterogeneous. Thus the SNPs selected or characterized as described herein can be used to help identify patients most suited to therapy with particular pharmaceutical agents. Pharmacogenomics can also be used in pharmaceutical research to assist the drug selection process. (Linder et al. (1997), Clinical Chemistry, 43, 254; Marshall (1997), Nature Biotechnology, 15, 1249; International Patent Application WO 97/40462, Spectra Biomedical; and Schafer et al. (1998), Nature Biotechnology, 16(l):33-39.

[0076] In some embodiments, a phenotypic state of interest is a clinically recognized disease state. Such disease states include, for example, neurodegenerative disease (e.g., Alzheimer's disease), psychiatric disease (e.g., schizophrenia), cancer, cardiovascular disease, metabolic disease, inflammatory disease, and infectious disease. Control samples are obtained from individuals who do not exhibit the phenotypic state of interest or disease state (e.g., an individual who is not affected by a disease or who does not experience negative side effects in response to a given drug).

[0077] Examples of neurodegenerative disease phenotypes include, but are not limited to, Alzheimer's disease, Huntington's disease, Amyotrophic lateral sclerosis, HFV-associated dementia Multiple sclerosis, Parkinson disease, Pick's disease, Corticobasal degeneration, Lewy body dementia, Spinocerebellar ataxia, and Spinal muscular atrophy.

[0078] Examples of psychiatric disease phenotypes include, but are not limited to, attention deficit disorder, clinical depression, bipolar disorder, schizophrenia, obsessive-compulsive disorder, anxiety, and insomnia.

[0079] Examples of cancer phenotypes include, but are not limited to, breast cancer, skin cancer, bone cancer, prostate cancer, liver cancer, lung cancer, brain cancer, cancer of the larynx, gallbladder, pancreas, rectum, parathyroid, thyroid, adrenal, neural tissue, head and neck, colon, stomach, bronchi, kidneys, basal cell carcinoma, squamous cell carcinoma of both ulcerating and papillary type, metastatic skin carcinoma, osteo sarcoma, Ewing's sarcoma, veticulum cell sarcoma, myeloma, giant cell tumor, small-cell lung tumor, gallstones, islet cell tumor, primary brain tumor, acute and chronic lymphocytic and granulocytic tumors, hairy-cell tumor, adenoma, hyperplasia, medullary carcinoma, pheochromocytoma, intestinal ganglioneuromas, hyperplastic corneal nerve tumor, marfanoid habitus tumor, Wilm's tumor, seminoma, ovarian tumor, leiomyomater tumor, cervical dysplasia and in situ carcinoma, neuroblastoma, retinoblastoma, soft tissue sarcoma, malignant carcinoid, topical skin lesion, mycosis fungoide, rhabdomyosarcoma, Kaposi's sarcoma, osteogenic and other sarcoma, malignant hypercalcemia, renal cell rumor, polycythemia vera, adenocarcinoma, glioblastoma multiforma, leukemias, lymphomas, malignant melanomas, epidermoid carcinomas, and other carcinomas and sarcomas.

[0080] Examples of cardiovascular disease phenotypes include, but are not limited to, congestive heart failure, high blood pressure, arrhythmias, cholesterol, Wolff-Parkinson-White Syndrome, long QT syndrome, angina pectoris, tachycardia, bradycardia, atrial fibrillation, ventricular fibrillation, congestive heart failure, myocardial ischemia, myocardial infarction, cardiac tamponade, myocarditis, pericarditis, arrhythmogenic right ventricular dysplasia, hypertrophic cardiomyopathy, Williams syndrome, heart valve diseases, endocarditis, bacterial, pulmonary atresia, aortic valve stenosis, Raynaud's disease, Raynaud's disease, cholesterol embolism, Wallenberg syndrome, Hippel-Lindau disease, and telangiectasis.

[0081] Examples of metabolic disease include, but are not limited to, obesity, appetite disorders, overweight, cellulite, Type I and Type II diabetes, hyperglycemia, dyslipidemia, steatohepatitis, liver steatosis, non-alcoholic steatohepatitis, Syndrome X, insulin resistance, diabetic dyslipidemia, anorexia, bulimia, anorexia nervosa, hyperlipidemia, hypertriglyceridemia, atherosclerosis, or arteriosclerosis. [0082] Examples of inflammatory disease include, but are not limited to, rheumatoid, arthritis, nonspecific arthritis, inflammatory disease of the larynx, inflammatory bowel disorder, pelvic inflammatory disease, inflammatory disease of the central nervous system, temporal arteritis, polymyalgia rheumatica, ankylosing spondylitis, polyarteritis nodosa, Reiter's syndrome, scleroderma, systemis lupus and erythematosus.

[0083] Examples of infectious disease include, but are not limited to, AIDS, hepatitis C, SARS, tuberculosis, sexually transmitted diseases, leprosay, lyme disease, malaria, measles, meningitis, mononucleosis, whooping cough, yellow fever, tetanus, arboviral encephalitis, and other bacterial, viral, fungal or helminthic diseases. Databases

[0084] Also included herein are databases containing information concerning one or more SNP alleles including one or more numerical values that characterize the SNP according to any of the methods described herein. Such numerical values include, e.g., inferred FRC versus distance from the SNP locus (i.e., LDD), ALnLH for the SNP, its deviation from the Av-ALnLH, or any other statistical measure of LDD deviation of the SNP allele from an SNP dataset average of LDD (e.g., a genome-wide average of LDD) In some embodiments, the databases can include patterns of SNP allele occurrence (i.e., an SNP profile) associated with one or more phenotypic states (e.g., Databases may also contain information associated with a given variation such as descriptive information about the general genomic region in which the variation occurs, such as whether the variation is located in or in proximity to a known gene (e.g., within at least 100 kb of a known gene).

[0085] Other information that may be included in the databases of the present invention include, but are not limited to, SNP sequence information, descriptive information concerning the clinical status of a tissue sample analyzed for SNP profiles, or the clinical status of the patient from which the sample was derived. The database may be designed to include different parts, for instance a variation database, an SNP database, an SNP LDD pattern database and an informative SNP database, e.g., a database associating with each SNP record the probability that an allele of the SNP was subjected to selective pressure, and including, where determined, the predictive or diagnostic value of one or more SNP alleles for a particular phenotypic state. Methods for the configuration and construction of databases are widely available, for instance, see Akerblom et al., (1999) U.S. Pat. No. 5,953,727.

[0086] The databases of the invention may be linked to an outside or external database. In preferred embodiments, the database may communicate with outside data sources, such as The SNP Consortium (TSC) or the National Center for Biotechnology Information through the internet. [0087] Any appropriate computer platform may be used to perform the methods for characterizing one or more SOIs by the methods described herein, as well as computing association statistics tests between characterized SOI alleles and phenotypic states of interest. In some embodiments, the computer platform can receive direct input from a database, e.g., one of the databases described herein. For example, a large number of computer workstations are available from a variety of manufacturers, such has those available from Silicon Graphics. Client-server environments, database servers and networks are also widely available and are appropriate platforms for the databases of the invention.

[0088] The databases described herein may also be used to present information identifying a SNP allele profile in an individual and such a presentation may be used to predict or diagnose one or more phenotypic states for that individual. Such methods may be used to predict phenotypic states for an individual. Such phenotypic states include, but are not limited to, phenotypes resulting from an altered environment, drug treatment, genetic manipulations or mutations, injury, change in diet, aging, or any other characteristic(s) of a single organism or a class or subclass of organisms. Further, the databases described herein may comprise information relating to the expression level of one or more of the genes associated with a phenotypic state of interest. In one embodiment, the database includes information relating to the expression level, in an individual, of one or more genes located within 100 kb of an SOI allele characterized by the methods described herein as having an ALnLH greater than 2.6 SD away from a genome-wide ALnLH. Accordingly, expression level information for one or more of the following human ortholog genes is included in the database: IL1RAPL2, FOLHl, KIAA1463, PRRGl, S66645/TYR, TTBK2, AK096379, BC034574, AUTS2, BX537851, COHl, COL4A6, CSEN, GALK2, GLRA2, HCNl, OPHNl, OR4A5, OR4C13, OR4X1, OR5AS1, OR8I2, OR8K1, PROM2, RAPSN, RTNl, SNTGl, PSMC3, SERPINCl, ADK, AF035029, AF035035, AP3M1, CSMD3, D63480, F8, FLJ42925, LRBA, SCAPl, SPIl, SUPT3H, NCOA6, AK091585, AK131264, AK131417, AK097440, AK131413, ECM2, POTE8, RNF18, STS, SLC39A13, ZBTB37, ZNF192, ZNF193, BRODL, CSElL, FANCC, PARP4, PNUTL2, TP53INP2, AF318371, BC046415, CPEB3, DJ467N11.1, HGNT-IV-H, HKRl, KHDRBS2, MGC46496, NT5C2, RAD51C, ZCWCC2, ZNF2, ZNF322A, ZNF37A, ZNF514, ZNF569, ZNF570, C15orfl6, C6.1A, MAST2, PPMlE, UBRl, AB037807/CCM1, FLJ14442, HDHDlA, IMMP2L, LOC220594, MTMR4, TEX14, AF318346, AKAP9, CDC91L1, FUNDC2, MAGEDl, SFIl, AB058732, AFl 16680, AK090675, AK125992, ASCC3, BC027488, C10orf68, CPNE8, FAM46D, FAM47A, FLJ32191, FLJ33979, FLJ46156, GBA, GPR23, HPSE2, KIAA0377, LOC492307, LOC51057, MAGEB6, MGC12197, MGC35232, MTCH2, OCIADl, RABGAPlL, SH3BGRL, VCL.

[0089] The full gene name and sequence information for any of the above genes can be found on the world wide web, e.g., at ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene.

IV. Predictive Methods

[0090] Association studies provide a means to predict whether an individual who possesses a particular allele, or a particular pattern of alleles, is more or less likely to exhibit a particular phenotype. Such predictive value is useful in, e.g., diagnosis, prognosis, prediction of response to an environmental agent (e.g., a drug or treatment), and/or selection of treatment. Accordingly, the invention provides methods of diagnosis, prognosis, treatment selection, and stratification, based on predicted phenotype for an individual inferred from the allele state of one or more SNPs (e.g., patterns of SNPs) as selected or characterized herein. Typically, such a SNP or pattern of SNPs will have been further selected by association studies or other validation processes as described herein or as known in the art. These alleles of SNPs and patterns of alleles of SNPs that reflect and differentiate biological states are utilized in clinically useful formats and in research contexts. Clinical applications include detection of disease; distinguishing disease states to inform prognosis, selection of therapy, and the prediction of therapeutic response; disease staging; identification of disease processes; prediction of efficacy; and prediction of adverse response. Such applications are known in the art; see, e.g., U.S. Patent Nos. 7,135,286 and 6,955,883, and U.S. Patent Application Nos. 20030186244; 20040241657; 20060177847; and 20060228728.

[0091] Thus, in some embodiments, the invention provides a method of determining a diagnosis, prognosis, or status of treatment for an individual, by (i) determining the identity of the alleles for a SNP from a sample obtained from the individual, where each allele of the SNP has been assigned a value based on a quantitative measure of the probability of selective pressure on the SNP; and (ii) determining a diagnosis for the individual based on the identity of the allele or alleles for the SNP. In some embodiment, the identity of the allele or alleles for a plurality of SNPs is determined and the diagnosis, prognosis, or treatment is determined based on the identity of the plurality of alleles. In some embodiment, the SNP is assigned a weighted value based on the quantitative measure of selective pressure on the SNP, and the diagnosis, prognosis, or treatment is based on an analysis of a combination of the weighted value and the identity of the allele or alleles for the SNP. In some embodiments, the identities of the alleles for a plurality of SNPs are determined, where each allele of each the SNPs has been assigned a value based on a quantitative measure of selective pressure on the SNP, and a diagnosis, prognosis, or treatment is determined based on the identity of the alleles for the plurality of SNPs.

[0092] In some embodiments, the invention provides a method of stratifying a population of individuals, where the stratification is based on the likelihood of each individual exhibiting a phenotype, by (i) determining the identity of the alleles for a SNP for the individuals, where each allele of the SNP has been assigned a value based on a quantitative measure of selective pressure on the SNP; and (ii) determining the position for the individual in the stratification based on the identity of the allele or alleles for the SNP. In some embodiments, the identities of the alleles for a plurality of SNPs are determined for the individual, where each allele of each the SNPs has been assigned a value based on a quantitative measure of selective pressure on the SNP, and the position for the individual in the stratification is determined based on the identity of the alleles for the plurality of SNPs. In some embodiments, the phenotype is response to a treatment, for example, administration of a drug. In some embodiments, the response comprises a therapeutic response to administration of the drug. In some embodiments, the response comprises a side effect of the drug, e.g., a negative side effect.

[0093] In addition, the invention provides methods of diagnosis, prognosis, and treatment based on detection of protein expression products of one or more genes associated with an allele that has been characterized by assigning a value based on a quantitative measure of selective pressure on the SNP, utilizing any of the methods described herein. Typically such a gene will have been further characterized as associated with a particular phenotype, e.g., disease susceptibility or resistance, or drug response, by one or more association studies as described herein. Proteins are encoded by nucleic acids, including those comprising markers that are correlated to the phenotypes as described herein. See, e.g., , Alberts et al. (2002) Molecular Biology of the Cell, 4^th Edition Taylor and Francis, Inc., ISBN: 0815332181 ("Alberts"), and Lodish et al. (1999) Molecular Cell Biology 4^th Edition W H Freeman & Co, ISBN: 071673706X ("Lodish"). Accordingly, proteins corresponding to genes can be detected as markers, e.g., by detecting different protein isotypes between individuals or populations, or by detecting a differential presence, absence or expression level of such a protein of interest (e.g., expression level of a protein encoded by a gene associated with an allele that has been characterized by assigning a value based on a quantitative measure of selective pressure on the SNP, utilizing any of the methods described herein).

[0094] A variety of protein detection methods are known and can be used to distinguish markers. A variety of protein manipulation and detection methods are well known in the art, including, e.g., those set forth in R. Scopes, Protein Purification, Springer- Verlag, N. Y. (1982); Deutscher, Methods in Enzymology Vol. 182: Guide to Protein Purification, Academic Press, Inc. N. Y. (1990); Sandana (1997) Bioseparation of Proteins, Academic Press, Inc.; Bollag et al. (1996) Protein Methods. 2^nd Edition Wiley-Liss, NY; Walker (1996) The Protein Protocols Handbook Humana Press, NJ, Harris and Angal (1990) Protein Purification Applications: A Practical Approach ERL Press at Oxford, Oxford, England; Harris and Angal Protein Purification Methods: A Practical Approach ERL Press at Oxford, Oxford, England; Scopes (1993) Protein Purification: Principles and Practice 3^rd Edition Springer Verlag, NY; Janson and Ryden (1998) Protein Purification: Principles, High Resolution Methods and Applications, Second Edition Wiley- VCH, NY; and Walker (1998) Protein Protocols on CD-ROM Humana Press, NJ; and the references cited therein. Additional details regarding protein purification and detection methods can be found in Satinder Ahuja ed., Handbook of Bioseparations, Academic Press (2000).

[0095] "Proteomic" detection methods, which detect many proteins simultaneously have been described. These can include various multidimensional electrophoresis methods (e.g., 2-D gel electrophoresis), mass spectrometry based methods (e.g., SELDI, MALDI, electrospray, etc.), or surface plasmon reasonance methods. For example, in MALDI, a sample is usually mixed with an appropriate matrix, placed on the surface of a probe and examined by laser desorption/ionization. The technique of MALDI is well known in the art. See, e.g., U.S. Pat. No. 5,045,694 (Beavis et al.), U.S. Pat. No. 5,202,561 (Gleissmann et al.), and U.S. Pat. No. 6,111,251 (Hillenkamp). Similarly, for SELDI, a first aliquot is contacted with a solid support- bound (e.g., substrate-bound) adsorbent. A substrate is typically a probe (e.g., a biochip) that can be positioned in an interrogatable relationship with a gas phase ion spectrometer. SELDI is also a well known technique, and has been applied to diagnostic proteomics. See, e.g. Issaq et al. (2003) "SELDI-TOF MS for Diagnostic Proteomics" Analytical Chemistry 75: 149A- 155 A.

[0096] En general, the above methods can be used to detect different forms (alleles) of proteins and/or can be used to detect different expression levels of the proteins (which can be due to allelic differences) between individuals, families, lines, populations, etc. Differences in expression levels, when controlled for environmental factors, can be indicative of different alleles at a QTL for the gene of interest, even if the encoded differentially expressed proteins are themselves identical. This occurs, for example, where there are multiple allelic forms of a gene in non-coding regions, e.g., regions such as promoters or enhancers that control gene expression.

Thus, detection of differential expression levels can be used as a method of detecting allelic differences.

Sample Collection and Processing

[0097] Samples for use with the methods described herein, as well as for use with reagents and compositions described herein, may be collected from a variety of sources in a given patient. Samples collected are preferably bodily fluids such as blood, serum, sputum, including, saliva, plasma, nipple aspirants, synovial fluids, cerebrospinal fluids, sweat, urine, fecal matter, pancreatic fluid, trabecular fluid, cerebrospinal fluid, tears, bronchial lavage, swabbings, bronchial aspirants, semen, precervicular fluid, vaginal fluids, pre-ejaculate, etc. In some embodiments, a sample collected is blood, e.g., approximately 1 to 5 ml of blood.

[0098] In some instances, samples may be collected from individuals over a longitudinal period of time (e.g., once a day, once a week, once a month, biannually or annually). Obtaining numerous samples from an individual over a period of time can be used to verify results from earlier detections. Samples can be obtained from humans or non-humans. In a preferred embodiment, samples are obtained from humans.

[0099] The target nucleic acid samples obtained from a patient or subject sample (e.g., a blood sample) can be genomic, RNA or cDNA. Genomic DNA samples are usually subject o amplification before application to an array. An individual genomic DNA segment from the same genomic location as a designated reference sequence can be amplified by using primers flanking the reference sequence.

Multiple genomic segments corresponding to multiple reference sequences can be prepared by multiplex amplification including primer pairs flanking each reference sequence in the amplification mix.

Alternatively, the entire genome can be amplified using random primers (typically hexamers) (see Barrett et al., Nucleic Acids Research 23, 3488-3492 (1995)) or by fragmentation and reassembly (see, e.g.,

Stemmer et al., Gene 164, 49-53 (1995)). Genomic DNA can be obtained from virtually any tissue source

(other than pure red blood cells). For example, convenient tissue samples include whole blood, semen, saliva, tears, urine, fecal material, sweat, buccal, skin and hair.

[00100] Where an SNP allele, selected as described herein, falls within a genomic region that is transcribed, the presence of the allele can be detected in a target RNA sample. In this case, amplification is typically preceded by reverse transcription. Amplification of all expressed mRNA can be performed as described by WO 96/14839 and WO 97/01603.

[00101] Methods of protein extraction are as described elsewhere herein.

[00102] Methods of Amplification The PCR method of amplification is described in PCR Technology:

Principles and Applications for DNA Amplification (ed. H. A. Erlich, Freeman Press, NY, N. Y., 1992);

PCR Protocols: A Guide to Methods and Applications (eds. Innis, et al., Academic Press, San Diego, Calif., 1990); Mattila et al, Nucleic Acids Res. 19, 4967 (1991); Eckert et al., PCR Methods and Applications 1, 17 (1991); PCR (eds. McPherson et al., IRL Press, Oxford); and U.S. Pat. No. 4,683,202. Nucleic acids in a target sample are usually labeled in the course of amplification by inclusion of one or more labeled nucleotides in the amplification mix. Labels can also be attached to amplification products after amplification e.g., by end-labeling. The amplification product can be RNA or DNA depending on the enzyme and substrates used in the amplification reaction.

[00103] Other suitable amplification methods include the ligase chain reaction (LCR) (see Wu and Wallace, Genomics 4, 560 (1989), Landegren et al., Science 241, 1077 (1988), transcription amplification (Kwoh et al., Proc. Natl. Acad. ScL USA 86, 1173 (1989)), and self-sustained sequence replication (Guatelli et al., Proc. Nat. Acad. ScL USA, 87, 1874 (1990)) and nucleic acid based sequence amplification (NASBA). The latter two amplification methods involve isothermal reactions based on isothermal transcription, which produce both single stranded RNA (ssRNA) and double stranded DNA (dsDNA) as the amplification products in a ratio of about 30 or 100 to 1, respectively.

V. Compositions

[00104] The compositions described herein include reagents that can specifically detect the presence, in a nucleic acid sample, of the major, minor, or both alleles of SNPs characterized by the methods described herein as having a high probability of selective pressure on one of its alleles. In some embodiments, the reagent or composition may detect one or more SNPs from a set or subset of SNPs as described herein.

[00105] In some embodiments, the detection reagents and methods detect one or more SNPs that have a major or minor allele ("A"), where ALnLH (A) is at least about 6, 5, 4, 3.5, 3.4, 3.3, 3.2, 3.1, 3.0, 2.9, 2.8, 2.7, 2.6, 2.5, 2.4, 2.3, 2.2, 2.1, 2.0, 1.9, 1.8, 1.7, 1.6, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, or less than 0.5 SD away from the Av-ALnLH value; optionally, alternate allele of A is less than about 3, 2, 1.5, 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1 or fewer SD away from the Av-ALnLH value. In one embodiment, the detection reagents are capable of detecting the presence, in a nucleic acid sample, of all SNPs from a genome-wide SNP set (e.g., the Perlegen set or the HapMap set) that meet one of the just- mentioned ALnLH standard deviation thresholds (e.g., ALnLH(A) 2.6 SD away from Av-ALnLH), where the Av-ALnLH is calculated for all SNPs in the genome-wide set. In other embodiments, the detection reagents are capable of detecting the presence, in a nucleic acid sample, of all SNPs from subset of a genome-wide SNP set, as described herein.

[00106] In some embodiments, the detection reagent is a nucleic acid array containing nucleic acid probes (e.g., oligonucleotides) that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs meeting a set of ALnLH deviation criteria as described herein. [00107] Thus, in some embodiments, the invention provides a nucleic acid array comprising SNP probes, wherein one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by the methods described herein. In some embodiments, the one or more probes include at least about 10%, 20%, 30%, 40%, 50%, 60% 70%, 80%, 85%, 90%, 95% or substantially all of non-redundant probe sequences in the array.

[00108] In some embodiments, the one or more SNP probes include probes that selectively hybridize at high strincency to at least about 10%, 20%, 30%, 40%, 50%, 60% 70%, 80%, 85%, 90%, 95% or substantially all alleles of the SNPs in one or more of the SNPs listed in Tables 1, 3, 5, 7, 9, or 10. In some embodiments, the invention provides a method of performing an array-based SNP assay, by conducting a nucleic acid array-based assay on a nucleic acid sample from a subject, where the nucleic acid array comprises SNP probes, where one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by the methods described herein.

[00109] The invention also provides kits. In one embodiment, the invention provides a kit for use in screening a nucleic acid sample for the presence of the major or minor allele of one or more SNPs, said kit comprising a SNP detection reagent and a control nucleic acid sample, wherein the one or more SNPs are SNPs selected by any of the methods described herein. The invention also provides a collection of a plurality of different SNP profiles each paired with a specific nucleic acid source, where (i) said collection is recorded on a substrate and (ii) the SNP profile includes the genotypes of the SNPs selected according to any of the methods described herein. In some embodiments, the substrate is a computer readable medium. [00110] The design of suitable probe arrays for analysis of predetermined polymorphisms and interpretation of the hybridization patterns is described in detail in WO 95/11995; EP 717,113; and WO 97/29212. Such arrays typically contain first and second groups of probes which are designed to be complementary to different allelic forms of the polymorphism. Each group contains a first set of probes, which is subdivided into subsets, one subset for each polymorphism. Each subset contains probes that span a polymorphism and proximate bases and are complementary to one allelic form of the polymorphism. Thus, within the first and second probe groups there are corresponding subsets of probes for each polymorphism. The hybridization patterns of these probes to target samples can be analyzed by footprinting or cluster analysis, as described above. For example, if the first and second probes groups contain subsets of probes respectively complementarity to first and second allelic forms of a polymorphic site spanned by the probes, then on hybridization of the array to a sample that is homozygous for the first allelic form all probes in the subset from the first group show specific hybridization, whereas probes in the subset from the second group that span the polymorphism show only mismatch hybridization. The mismatch hybridization is manifested as a footprint of probe intensities in a plot of normalized probe intensity (i.e., target/reference intensity ratio) for the subset of probes in the second group. Conversely, if the target sample is homozygous for the second allelic form, a footprint is observed in the normalized hybridization intensities of probes in the subset from the first probe group. If the target sample is heterozygous for both allelic forms then a footprint is seen in normalized probe intensities from subsets in both probe groups although the depression of intensity ratio within the footprint is less marked than in footprints observed with homozygous alleles.

[00111] Alternatively, the first and second groups of probes can contain first, second, third and fourth probe sets. Each of the probe sets can be subdivided into subsets, one for each polymorphism to be analyzed by the array. The first set of probes in each group is spans a polymorphic site and proximate bases and is complementary to one allelic form of the site. The second, third and fourth sets, each have a corresponding probe for each probe in the first probe set, which is identical to a corresponding probe from the first probe set except at the interrogation position, which occurs in the same position in each of the four corresponding probes from the four probe sets, and is occupied by a different nucleotide in the four probe sets.

[00112] Analysis of the hybridization pattern of a nucleic acid array to a nucleic acid sample indicates which allelic form is present at some or all of the SNP sequences represented on the array. Thus, an individual can be characterized with a polymorphic profile representing allelic variants of the SNPs selected by the methods described herein.

[00113] Arrays of probe immobilized on supports can be synthesized by various methods. A preferred methods is VLSIPS.TM. (see Fodor et al., 1991, Fodor et al., 1993, Nature 364, 355-556; McGaIl et al., U.S. Ser. No. 08/445,332; U.S. Pat. No. 5,143,854; EP 476,014, which are incorporated by reference in their entirety herein), which entails the use of light to direct the synthesis of oligonucleotide probes in high-density, miniaturized arrays. Algorithms for design of masks to reduce the number of synthesis cycles are described by Hubbel et al., U.S. Pat. No. 5,571,639 and U.S. Pat. No. 5,593,839, which are incorporated by reference in their entirety herein. Arrays can also be synthesized in a combinatorial fashion by delivering monomers to cells of a support by mechanically constrained flowpaths. See Winkler et al., EP 624,059, incorporated by reference in its entirety herein. Arrays can also be synthesized by spotting monomers reagents on to a support using an ink jet printer. See, e.g., EP 728,520, incorporated by reference in its entirety herein.

[00114] After hybridization of control and target samples to an array containing one or more probe sets as described above and optional washing to remove unbound and nonspecifically bound probe, the hybridization intensity for the respective samples is determined for each probe in the array. For fluorescent labels, hybridization intensity can be determined by, for example, a scanning confocal microscope in photon counting mode. Appropriate scanning devices are described in, e.g., U.S. Pat. No. 5,578,832; and U.S. Pat. No. 5,631,734, incorporated by reference in their entirety herein. [00115] In other embodiments, the presence of an SNP allele, selected according to the methods described herein, can be detected using a primer extension reaction or amplification reaction. For example, a nucleic acid sample containing (or suspected of containing) a target nucleic acid molecule can be contacted with an oligonucleotide primer that, upon further contact with a polymerase, can be extended up to and, if desired, beyond the position of the SNP. In addition, the nucleic acid sample can be contacted with an amplification primer pair, comprising a first primer and a second primer, which selectively hybridize to complementary strands of a target nucleic acid molecule and, in the presence of polymerase, allow for generation of an amplification product. For convenience, the primers of an amplification primer pair are referred to as a "first primer" and a "second primer"; however, reference herein to a "first primer" or a "second primer" is not intended to indicate any importance, order of addition, or the like. It will be further recognized that an amplification primer pair requires that the first and second primer comprise what are commonly referred to as a forward primer and a reverse primer. [00116] A primer extension or PCR amplification reaction can be designed such that the presence of a particular nucleotide at an SNP position can be determined by the presence or size of the extension and/or amplification product, in which case the SNP can be determined using a method such as gel electrophoresis, capillary gel electrophoresis, or mass spectrometry; or the amplification product can be sequenced to determine the nucleotide at the SNP position. In addition, the SNP can be detected indirectly, for example, by further contacting the sample with a detector oligonucleotide, which can selectively hybridize to a nucleotide sequence of the first amplification product comprising the SNP position; and detecting selective hybridization of the detector oligonucleotide, as above. [00117] Various endpoint detection formats are known to the art and can be applied to the present methods. For example, PCR can be performed using TaqMan™. reagents, followed by reading the plates at this endpoint. Molecular beacons, Amplifluor™ or TriStar™ reagents and methods similarly can be used (Stratagene; Intergen). Amplification products also can be detected using an ELISA format, for example, using a design in which one primer is biotinylated and the other contains digoxygenin. The amplification products are then bound to a streptavidin plate, washed, reacted with an enzyme- conjugated antibody to digoxygenin, and developed with a chromogenic, fluorogenic, or chemiluminescent substrate for the enzyme. Alternatively, a radioactive method can be used to detect generated amplification products, for example, by including a radiolabeled deoxynucleoside triphosphate into the amplification reaction, then blotting the amplification products onto DEAE paper for detection. In addition, if one primer is biotinylated, then streptavidin-coated scintillation proximity assay plates can be used to measure the PCR products. Additional methods of detection can use a chemiluminescent label, for example, a lanthanide chelate such as used in the DELFIA™ assay (Pall Corp.), an electrochemiluminescent label such as ruthenium tris-bipyridy (ORI-GEN), or a fluorescent label, for example, using fluorescence correlation spectroscopy. [00118] An assay system that is commercially available and can be used to identify a nucleotide occurrence of one or more SNPs is the SNP-IT™ assay system (Orchid BioSciences, Inc.; Princeton NJ.)- In general, the SNP-IT™. method is a three step primer extension reaction. In the first step a target nucleic acid molecule is isolated from a sample by hybridization to a capture primer, which provides a first level of specificity. In a second step the capture primer is extended from a terminating nucleotide triphosphate at the target SNP site, which provides a second level of specificity. In a third step, the extended nucleotide triphosphate can be detected using a variety of known formats, including, for example, by direct fluorescence, indirect fluorescence, an indirect colorimetric assay, mass spectrometry, or fluorescence polarization. Reactions conveniently can be processed in 384 well format in an automated format using a SNP stream™ instrument (Orchid BioSciences, Inc.).

[00119] Various methods for genotyping SNP alleles, selected as described herein, are readily adaptable to high throughput assays. For example, an amplification reaction such as PCR can be performed using inexpensive robotic thermocyclers for a specified number of cycles, then the amplification product generated can be determined at the endpoint of the reaction. Furthermore, the methods can be performed in a multiplex format, for example, using differentially labeled oligonucleotide probes, or performing oligonucleotide ligation assays that result in different sized ligation products, or amplification reactions that result in different sized amplification products. In another example, high-throughput mass spectrometry is used to detect SNP alleles in a target nucleic acid sample. Mass spectrometric methods for SNP genotyping are described in, e.g., U.S. Patent Nos: 7,132,519, 6,994,998; and U.S. Patent Application No 20060275789.

[00120] Where hybridization-based methods are used, high stringency conditions are those that result in perfect matches remaining in hybridization complexes, while imperfect matches melt off. Similarly, low stringency conditions are those that allow the formation of hybridization complexes with both perfect and imperfect matches. High stringency conditions are known in the art; see for example Maniatis et al. (1989), Molecular Cloning: A Laboratory Manual, 2d Edition; and Short Protocols in Molecular Biology, ed. Ausubel, et al. Stringent conditions are sequence-dependent and will be different in different circumstances. Longer sequences hybridize specifically at higher temperatures. An extensive guide to the hybridization of nucleic acids is found in Tijssen (1993), Techniques in Biochemistry and Molecular Biology— Hybridization with Nucleic Acid Probes, "Overview of principles of hybridization and the strategy of nucleic acid assays." Generally, stringent conditions are selected to be about 5-10 C lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength pH. The Tm is the temperature (under defined ionic strength, pH and nucleic acid concentration) at which 50% of the probes complementary to the target hybridize to the target sequence at equilibrium (as the target sequences are present in excess, at Tm, 50% of the probes are occupied at equilibrium). Stringent conditions will be those in which the salt concentration is less than about 1.0 M sodium ion, typically about 0.01 to 1.0 M sodium ion concentration (or other salts) at pH 7.0 to 8.3 and the temperature is at least about 30 C for short probes (e.g. 10 to 50 nucleotides) and at least about 60 C for long probes (e.g. greater than 50 nucleotides). Stringent conditions may also be achieved with the addition of destabilizing agents such as formamide. In another embodiment, less stringent hybridization conditions are used; for example, moderate or low stringency conditions may be used, as are known in the art. See, e.g., Maniatis and Ausubel, supra, and Tijssen, supra.

[00121] In a further aspect, the invention provides a computer-readable storage medium containing a set of instructions for a general purpose computer having a user interface comprising a display unit, the set of instructions comprising: (a) logic for inputting values from analysis of a sample by any of the methods described herein; and (b) a display routine for displaying the results of the input values with said display unit. In some embodiments, the instructions further comprises a comparison routine for comparing the inputted values with a database; and wherein the display routine further comprises logic for displaying the results of the comparison routine.

[00122] In still a further aspect, the invention provides an electronic signal or carrier wave that is propagated over the Internet between computers comprising a set of instructions for a general purpose computer having a user interface comprising a display unit, the set of instructions comprising a computer-readable storage medium containing a set of instructions for a general purpose computer having a user interface comprising a display unit, the set of instructions comprising: (a) logic for inputting values from analysis of a sample by any of the methods described herein; and (b) a display routine for displaying the results of the input values with said display unit. In some embodiments, the set of instructions further comprises a comparison routine for comparing the inputted values with a database; and wherein the display routine further comprises logic for displaying the results of the comparison routine.

VI. Business Methods

[00123] In some embodiments, the methods described herein utilize and apply a system that is able to associate one or more phenotypes (i.e., a biological state) with the presence, in a target sample nucleic acid, of one or more SNP alleles selected for detection by the methods described herein. In one embodiment, the system relies on an integrated, reproducible, sample preparation, separation, and genotyping system with informatics. The genotyping system can employ any of the SNP genotype assays known in the art (e.g., nucleic acid array hybridization) including those described herein. This system will serve as the foundation for the discovery of patterns of SNP profiles that reflect and differentiate biological states specific for various states of health and disease. These patterns of SNP profiles that reflect and differentiate biological states will then be utilized in clinically useful formats and in research contexts. Clinical applications will include detection of disease; distinguishing disease states to inform prognosis, selection of therapy, and the prediction of therapeutic response; disease staging; identification of disease processes; prediction of efficacy; prediction of adverse response; monitoring of therapy associated efficacy and toxicity; and detection of recurrence.

[00124] For example a business entity (alone or with collaborators) collects a representative sample set of case samples and control samples. The case samples will be those wherein a patient exhibits a particular disease state or other phenotype. For example, the case samples may be those where a patient exhibits a response to a drug. Conversely, the control samples will be collected from patients that do not exhibit the phenotype under study, such as those that do not have the disease or response to a drug. Preferably more than 10 case and 10 control samples are collected for use. Preferably more than 20 case and 20 control samples, preferably more than 50 case and 50 control samples, preferably more than 100 case and 100 control samples, and most preferably more than 500 case and 500 control samples are collected.

[00125] The case and control samples are assayed to identify profiles of SNPs, selected as described herein, that are present in the case and control samples.

[00126] In some embodiments, the assay identifies the presence of more than about 10, 50, or 100 SNP alleles selected as described herein, or more than 200 SNP alleles, or more than 500 SNP alleles, or more than 1000 SNP alleles, or more than 5000 SNP alleles or more than 10,000 SNP alleles or more than 100,000 SNP alleles. The business takes advantage of the presence of (or absence of) a pattern of SNP profiles repeatedly found to be in the cases in a pattern distinct from the controls. [00127] In some embodiments, an early assay, such as the first assay, is followed by a later assay. The early assay will be normally be used in initial identification of the selected SNP profiles that identify or separate cases from controls. The later assay is adjusted according to parameters that can focus diagnostics or evaluation of SNP allele subsets of interest, such as those SNP alleles for which there are significant differences between in frequencies between case samples and control samples. The parameters can be determined by, for example, an early assay which may identify the subset of SNP alleles, which may be on one technology platform, and a later assay on the same or a different platform. [00128] In one embodiment, bioinformatics systems are utilized to identify the differences in the SNP profile patterns in the case and control samples. Such techniques may be proceeded by various data cleanup steps. Patterns will be composed of the relative representation of SNP alleles, selected for detection as described herein, the collective profile of which will have higher prognostic or diagnostic utility than the presence or absence of any single SNP allele.

[00129] Automated systems will generally be applied in the identification of the patterns of SNP profiles that distinguish cases and controls. The measurement of patterns of multiple signals will enable the identification of subtle differences in biological state and make the identification of that state more robust and less subject to biological noise. [00130] In some embodiments, the business uses the differential pattern of SNP profiles between case and controls to identify the disease state or to predict the development of the disease state based on the SNP profile determined from a patient sample in, for example, a diagnostic setting. [00131] The marketing of associated products can take a number of forms. For example, it may be that the developer actually markets the instruments and assays into the diagnostic research market. In alternative embodiments, the developer of the patterns will partner with, for example, a large diagnostic company that will market those products made by the developer, alone or in combination with their own products. In alternative embodiments, the developer of the patterns licenses the intellectual property in the patterns to a third party and derives revenue from licensing income arising from the pattern information.

[00132] The business method herein can obtain revenue by various means, which may vary over time. Such sources may include direct sale revenue of products, upfront license fees, research payment fees, milestone payments (such as upon achievement of sales goals or regulatory filings), database subscription fees, and downstream royalties and from various sources including government agencies, academic institution and universities, biotechnology and pharmaceutical companies, insurance companies, and health care providers.

[00133] Often, diagnostic services hereunder will be offered by clinical reference laboratories or by way of the sale of diagnostic kits. Clinical reference laboratories generally process large number of patient samples on behalf of a number of care givers and/or pharmaceutical companies. Such reference laboratories in the United States are normally qualified under CLIA and/or CAP regulations. Of course, other methods may also be used for marketing and sales such as direct sales of kits such as FDA or equivalent approved products. In some cases the developer of the pattern content will license the intellectual property and/or sell kits and/or reagents to a reference laboratory that will combine them with other reagents and/or instruments in providing a service.

[00134] In the short term, the business methods disclosed generate revenue by, for example, providing application specific research or diagnostic services to third parties to discover and/or market the SNP profile patterns. Examples of third-parties include customers who purchase diagnostic or research products (or services for discovery of patterns), licensees who license rights to pattern recognition databases, and partners who provide samples in exchange for downstream royalty rights and/or up front payments from pattern recognition. Depending on the fee, diagnostic services may be provided on an exclusive or non-exclusive basis.

[00135] Revenue can also be generated by entering into exclusive and/or non-exclusive contracts to provide SNP allele profiling of patients and populations. For example, a company entering clinical trials may wish to stratify a patient population according to, for example, drug regimen, effective dosage, or otherwise. Stratifying a patient population may increase the efficacy of clinical trial (by removing, for example, non responders), thus allowing the company to enter into the market sooner or allow a drug to be marketed with a diagnostic test that identifies patients that may have an adverse response or be non- responsive. In addition, insurance companies may wish to obtain an SNP profile of a potential insured and/or to determine if, for example a drug or treatment will be effective for a patient. [00136] In the long term, revenue may be generated by alternative methods. For example, revenue can be generated by entering into exclusive and/or non-exclusive drug discovery contracts with drug companies (e.g., biotechnology companies and pharmaceutical companies). Such contracts can provide for downstream royalties on a drug based on the identification or verification of drug targets (e.g., a particular protein or set of polypeptides associated with a phenotypic state of interest), or on the identification of a subpopulation in which such drug should be utilized. Alternatively, revenue may come from a licensee fee on a diagnostic itself. The diagnostic services, patterns, and tools herein can further be provided to a pharmaceutical company in exchange for milestone payments or downstream royalties. Revenue may also be generated from the sale of disposable fluidics devices, disposable microfluidics devices, or other assay reagents or devices in for example the research market, diagnostic market, or in clinical reference laboratories. Revenue may also be generated from licensing of applications-specific software or databases. Revenue may, still further, be generated based on royalties from technology platform providers who may license some or all of the proprietary technology. For example, a nucleic acid array platform provider may license the right to further distribute software and computer tools and/or SNP profile patterns.

[0090] The following specific examples are to be construed as merely illustrative, and not limitative of the remainder of the disclosure in any way whatsoever. Without further elaboration, it is believed that one skilled in the art can, based on the description herein, utilize the present invention to its fullest extent. All publications cited herein are hereby incorporated by reference in their entirety. Where reference is made to a URL or other such identifier or address, it is understood that such identifiers can change and particular information on the internet can come and go, but equivalent information can be found by searching the internet. Reference thereto evidences the availability and public dissemination of such information.

EXAMPLES

Example 1 Genome-Wide Identification of SNP alleles having a high probability of selective pressure [0091] We developed a probabilistic model, termed to distinguish large differences in linkage disequilibrium (LD) surrounding a given SNP. By examining individuals homozygous for a given SNP, the fraction of inferred recombinant chromosomes (FRC) at adjacent polymorphisms was directly computed without the need to infer haplotype. We used the expected increase with distance in FRC ( surrounding a selected allele to identify such alleles. Importantly, the method is insensitive to local recombination rate, because local rate will influence the extent of LD surrounding both major and minor alleles, while the method included looking for LD differences between alleles. [0092] As two well characterized examples, the patterns of FRCs surrounding the selected alleles DRD4 7R, a dopamine receptor, and G6PD V202M, a variant conferring malaria resistance in African populations, are strong indicators of selection (Fig. 2). The new allele attained a high population frequency yet still retained a strong local LD block in comparison with the alternative allele. More importantly, the progressive decay of this strong LD with distance from the selected allele is further evidence of selection acting on such sites. One observes this pattern because the number of possible meiotic recombinations not eliminating the advantageous allele increases as a function of distance from the selected site. The overall "rate" of LDD is influenced by the intraallelic coalescence time of the inferred selection and local recombination rate. For example, the G6PD V202M variant exhibits LDD similar to DRD4 7R, although the decay is 14 times slower (Fig. 2). This result is consistent, however, with the calculated 5- to 10-fold younger allele age of G6PD V202M and the 2- to 4-fold increase in recombination rate at the DRD4 locus.

[0093] Although the expected/predicted LDD due to ongoing recombination can be approximated by various linear or exponential curves (depending on the assumptions made regarding recombination), we used a standard sigmoidal curve (Fig. 2), consistent with prior work on allele age calculations and the acknowledgment that inferred recombination has a maximum value of 0.5.

[0094] A simplified example of our computational approach is shown in Fig. 1. First, we sorted each SNP of interest (marked as "S" in Fig. 1) by genotype homozygosity for the major and minor alleles of the SNP of interest, while heterozygous genotypes were not further analyzed. This method allowed the direct measure of adjacent inferred FRC without the need to infer haplotype. All adjacent SNPs within ±500 Kb were binned according to the separation distance from site S (Fig. IA, arrowhead). For SNP alleles at each of the adjacent sites throughout the neighboring ±500 Kb, we then computed its inferred FRC by dividing the number of heterozygous genotypes for each of the adjacent SNP sites by the total number of chromosomes in the dataset (i.e., 2 chromosomes/diploid genome), assuming the S variant arose on a single chromosome (haplotype). The distance away from S and each associated FRC was then recorded as a value pair into a list for the SNP of interest (Fig. IB). From this list, average log likelihood (ALnLH) was computed based on the sum of the square of the differences between the input model and the actual data, with uniform prior and a deviation function to account for experimental and recombination variation. This process was then repeated for all SNP sites, using a "sliding-window" of 1 Mb (containing 150-300 SNPs).

[0095] On average, the distance between each SNP in the Perlegen data set is 2 kb. See Hinds et al. supra. These data were generated by determining genotypes for 71 unrelated individuals from 3 populations: 24 European Americans, 23 African Americans, and 24 Han Chinese from the Los Angeles area. The total Perlegen data set was initially analyzed. Approximately 68% of the 1,586,383 Perlegen SNP sites had minor allele frequency of >10%. For approximately 49% of the total SNP sites, minor allele homozygosity occurred in >5% of the genotypes, which we took as our cut-off for analysis (0.22 allele frequency). [0096] We broadly sampled the range of putatively selected allele LDD defined as having an ALnLH greater than 1 SD away from the genomic average ALnLH (Av-ALnLH) (i.e., the nongray area in Fig. 2). This cut-off excluded some well documented selected alleles such as the telomeric gene DRD4 7R, which are in regions with too few neighboring SNPs currently typed in the database to stringently distinguish such alleles from average LDD. However, the LDD test can be reapplied to such regions following high density SNP-typing/resequencing.

[0097] For the purpose of this analysis, we took as "recently" selected alleles, which include a number of loci such as phenylthiocarbamide and lactase in addition to G6PD (Fig. 2), as ones that can be distinguished given the Perlegen resolution, coalescent time, and local recombination frequency, as well as their high (>0.22) allele frequency. Although "hotspots" for recombination likely occurred in human DNA, the large-scale (megabase) variation in recombination frequency in most nontelomeric euchromatic regions is known not to vary beyond 2- to 4-fold. Selected alleles detected with our approach, therefore, should have estimated coalescent times up to 10,000 years in areas of high recombination to >40,000 years.

[0098] We set a detection threshold at an ALnLH of >2.6SD(>99.5th percentile) from the genome average, or 0.61 for the Perlegen data set. The calculated genomewide Perlegen ALnLH scores exhibit an average of 0.043, but with a SD of 0.22. Hence, an ALnLH of 0.61 represents a highly unusual genetic architecture. In total, 25,386 (1.6%) of the 1.6 million Perlegen SNPs met these criteria. [0099] A display of regions of inferred selection along all chromosomes for the Perlegen (PLG) data set is shown in Fig. 8.

[00100] As an example of representative data, Fig. 3 shows the local genetic architecture centered at a 25-kb region defining the promoter of the Reticulon gene (RTNl) on chromosome 14 (Online Mendelian Inheritance in Man accession no. 600865 [OMEVI]) . As shown in Fig. 3, the randomness for neighboring recombinant chromosomes for the major RTNl allele at this site exemplifies the genome average, with little long-range LD. In contrast, the minor RTNl allele at this site closely matches the model of LDD for a positively selected allele (PSML). The large LD block around Perlegen SNP rs9323357 and its disproportionately high allelic frequency (35%) suggests a possible recent selective event at the RTNl locus. Interestingly, this gene encodes a neuroendocrine-specifϊc protein thought to affect cellular amyloid- and the formation of amyloid plaques in Alzheimer's disease. See He et al. (2004) Nat. Med. , 10:959-965. Indeed, the reticulon gene family has recently been implicated in a number of neurodegenerative disease including, amyotrophic lateral sclerosis. See Han et al. (2006), Cell MoI Life Sci., 63(7-8):877-89; and Fergani et al. (2005), Neurodegener Dis., 2(3-4): 185-194. [00101] Although the Perlegen data set has high SNP resolution, population depth is limited. The recently released HapMap data set (Phase I freeze), conversely, had fewer SNPs (1.0 million) but deeper population coverage: 90 European ancestry (CEU), 90 African (Yoruba) ancestry (YRI), 45 Han Chinese (CHB), and 45 Japanese (JPT) individuals. This data set allowed for an independent confirmation of our results. In addition, the greater depth of the HapMap data set allowed better definition of potential population-specific selective events, which accounted for only 22% of the Perlegen clusters. [00102] Calculations of ALnLH were conducted separately on all four HapMap populations, again using a cut-off of >2.6 SD (>99.5th percentile) from the genome average (ranging from 0.51 to 0.71 for YRI and CEU populations, respectively). SNPs identified by these criteria for each of these populations are listed in Tables 3 and 4 (CEU), 5 and 6 (CHB), 7 and 8 (JPT), and 9 and 10 (YRI). Merging all four HapMap populations yielded a total of 20,786 SNPs with evidence of selective pressure, similar to the Perlegen data set. Inferred selection for the four HapMap populations is shown in Fig. 8. Because there is only partial overlap between SNPs used by the Perlegen and HapMap efforts, both data sets were aligned along the Human Genome (hgl7) sequence (Kent et al (2002), Genome Res., 12:996-1006), using a 10- kb window for assigning regions. Encouragingly, there was a 77% (YRI) to 96% [Han Chinese (CHB)] overlap between the inferred selected regions identified by the Perlegen and HapMap data sets. For example, the RTNl promoter region originally identified in the Perlegen data set shows evidence for selection in all four HapMap populations. See Figs. 3C and 3D. Interestingly, the LDD at this locus is greater in the YRI population, as expected for an older population that has not undergone the severe recent bottlenecks inferred for Asian and European populations. In general, regions of inferred selection that are found in all populations exhibit this African-specific faster LDD. The genomic distribution of inferred selection using the LDD test was in general random, with no bias toward or against other unusual genomic regions such as segmental duplications or inversions. Although inversions can suppress recombination and produce large LD blocks, large (>100 kb) inversions are not common in human DNA, do not produce a gradual LDD as observed for selected alleles, and would not eliminate recombination at the high frequency of alleles reported in this work. For example, a recently reported large chromosome 17 inversion (Stefansson et al. (2005), Nat. Genet., 37: 129-137) produced a distinct pattern of flat LD clearly distinguishable from the alleles identified in this work (data not shown). The few inferred inversions detected by our analysis are not excluded, because their high frequency implies that selection may be maintaining them in the population.

[00103] There is a slight underrepresentation of detected selection in high-recombination areas such as telomeric regions, as expected given the particular parameters used for this initial screen. One strikingly nonrandom distribution, however, is an 2-fold overrepresentation of such alleles on the X chromosome (Fig. 8; P « 0.00001). Given that overall population recombination frequency on the X chromosome is not known to be significantly lower than the genome average, this result was consistent with the hypothesis that alleles on the "haploid" X chromosome will be under stronger selective pressure than those on diploid autosomes.

[00104] We asked if, in addition to selection, there are other mechanisms that could produce these unusual long-range genetic architectures. It is commonly assumed that one summary statistic is often insufficient to unambiguously detect recent selection from other population events. Many population- genetics tests, indeed, cannot distinguish selection from bottlenecks/admixture. This lack of discrimination is because of both a lack of acknowledgment of LD structure in these tests and the usual examination of small («1 Mb) genomic regions.

[00105] To determine whether other population parameters could influence the LDD test for selection, we used the HapMap European ancestry (CEU) population. We chose this population for analysis as a typical population in which recent bottlenecks/admixtures have occurred. The CEU chromosome 7 ALnLH values are shown in Fig. 5A. We generated a randomized chromosome 7 data set from this population by redistributing each SNP independently in a random uniform way and recomputed the ALnLH scores. The above process was repeated 1000 times. Typical scores from a single simulation are presented in Fig. 5B. Pairwise LD measurements in D' statistics were obtained by randomly sampling 150 SNP pairs at distances of 5, 20, 60, or 160 kb. To simulate an extreme bottleneck, original HapMap genotypes of 5 unrelated individuals, retaining all currently observed LD blocks on chromosome 7, were chosen 5D.

[00106] In addition to selection, we asked whether there are other mechanisms that could produce unusual long-range genetic architectures. First, one can ask if the threshold is unambiguously different from random. Given that many individual SNPs have ALnLH values calculated, and that the Perlegen and HapMap SNPs were chosen because of high heterozygosity, there is a concern that high ALnLH values could be achieved by chance. Random permutation of the actual Perlegen or HapMap data, then, is an effective way to determine whether a positive (>2.6 SD) score could be obtained by chance. Using such a simulated data set, the probability of reaching above an ALnLH of 0.2 (1 SD) is P < 0.0007 and above 0.71 (>2.6 SD) is P = 0.0 (Fig. 5B). We conclude, as expected, that chance alone cannot produce an ALnLH value of >2.6 SD

[00107] The randomized data set (Fig. 5B) is also a reasonable model for expected human population structure, the "null" model in many population simulations. High- heterozygosity alleles are assumed to be present prior to the major coalescence of humans 50,000-100,000 years ago, and, in the absence of selection, are known to exhibit little LD at distances of >5 kb. The computed D' for our randomized data set ( =Θ.2O at all distances of >5 kb) is identical to that predicted for a widely used coalescent simulation of human populations, where D' = 0.5 is <5 kb. See, e.g., Reich et al. (2001), Nature, 411:199-204; and Krugylyak (1996), Nat. Genet., 22: 139-144. In this model, the human population is assumed to have constant effective size N = 10,000 until 5,000 generations before the present, followed by exponential expansion to 5 billion. This coalescent model has been widely used to predict the inferred LD between common SNPs in the absence of selection or bottlenecks. Indeed, it was used as the basis (and rationale) for the Perlegen and HapMap projects, since the low LD predicted by this model was experimentally observed in African ancestry populations, while observed LD was greater in European ancestry populations (D' = 0.5 at 60 kb in Reich et al. supra and the current HapMap data set; see Fig. 5A). The more extensive LD observed in European ancestry populations was interpreted as the result of an extreme bottleneck, representing an inbreeding coefficient of at least F = 0.2, corresponding to an effective population size of as small as 50 individuals for 20 generations. See Reich et al. supra. [00108] Given that the random data set is a reasonable approximation of the high-heterozygosity SNP distribution expected for an ancient (50,000-100,000 years) population, one can use it to simulate various population structures that could lead to more extensive LD. In particular, one can estimate if the chosen threshold is high enough to eliminate detection of other potential sources of LD, such as population bottlenecks and/or admixture. Two different simulations were conducted to test these possibilities. [00109] First, 162 haplotypes from the randomized HapMap CEU dataset were "infused" with 18 copies of a single haplotype, representing a contribution of 10% (Fig. 5C). This extreme admixture/bottleneck model represents a population of 90 individuals in which 20% of the chromosomes, however, come from a single individual (10% from each homolog). It simulates the effect of both small population size and disproportionate contribution of a particular haplotype on the calculated ALnLH statistic. "Recombinations" were randomly generated from these haplotypes (once per chromosome arm), genotypes were generated in each generation randomly (for 500 generations, or approximately 10,000 "years"), and ALnLH values were calculated. Even for this extreme model, values of ALnLH only exceeded 0.2 (>1 SD) every 1.5 Mb (on average) during the first 10 generations and decreased to 7 Mb (on average) during the remaining 500 simulated generations, corresponding to decay of the original infused haplotype to average segments <100 kb (Figs. 5 C and 6). More importantly, although there were rare cases where ALnLH reached 0.6 (P < 0.0001) during the simulation, the genetic architectures of these "admixed" SNPs were different than the inferred selected alleles (Fig. 7). As expected, while extreme bottlenecks/admixture can produce occasional "blocks" of LD with high ALnLH values, the random nature of the "overlaps" produced by this mechanism exhibit little dependence of LD Decay with distance. An ALnLH of >0.71 (>2.6 SD from the average) was never observed in this extreme simulation model of population bottlenecks/admixture. Therefore, bottlenecks/admixture of a less extreme (and more likely) size for human populations will also never produce an ALnLH >2.6 SD from the genomic average ALnLH.

[00110] A second simulation was conducted of a bottleneck of only 5 individuals, all of whom contain the currently observed selected LD blocks. Again, this simulation was constructed as an extreme test, so that more likely bottleneck sizes could be evaluated. If evidence for obtaining an ALnLH >2.6 SD from the average for this simulation cannot be obtained, it cannot be obtained simulating less extreme bottlenecks. For this simulation, haplotypes were assembled from actual European ancestry (CEU) HapMap genotypes rather than the randomized genotypes (Fig. 5D). Again, recombinations were randomly generated, and genotypes and ALnLH values calculated for a total of 500 generations (10,000 years). While this model simulates a more extreme bottleneck than any likely to exist in European ancestry populations, it further tests the decay of the observed unusual genetic architectures in the absence of selection. As expected, in this simulation observed ALnLH values >0.71 (>2.6 SD) rapidly decay to a level 1/1 Oth that initially observed in actual CEU HapMap data (Fig. 5A) by generation 40 (P = 0.0017; Fig. 5D). After 400 generations, essentially no ALnLH values >0.71 were found (P = 0.00006; Fig. 5D). We conclude that no plausible bottleneck in European ancestry populations could account for the observed genetic architectures. We further concluded that, as expected, without ongoing selection, the observed LD blocks rapidly decay in length.

[00111] These simulations were conducted because prior population genetics simulations and coalescence models of population structure cannot be compared directly with the highly biased Perlegen and HapMap SNP data set, consisting largely of high-heterozygosity SNPs. These simulations indicated that the LDD test, at the megabase scale used, appeared to effectively distinguish between effects due to selection versus demographic history. Thus, we concluded that inferred recent Darwinian selection (selective pressure) is the most likely explanation for the unusual genomic architectures surrounding the SNP alleles meeting the LDD criteria we selected.

Example 2 Genes identified within a distance of 100 Kb of high ALnLH SNP alleles [00112] The inferred selected SNPs were queried into the National Center for Biotechnology Information SNP Database (dbSNP; www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp) Build 123 for associated genes/exons within a 100-kb radius. A 100-kb radius was chosen as a reasonable first-pass distance in which a SNP could influence a gene's expression/function, given current knowledge of gene organization and regulation. See The International Human Genome Sequencing Consortium (2004), Nature, 431:931-945; and Wasserman et al. (2004), Nat. Rev. Genet., 5:276-287. Approximately 35% of the inferred selected SNPs were not within a 100-kb radius of known genes and were not analyzed further. Whether this fraction represents selection at noncoding regions or the inability to identify all potential gene regions in the current HGP assembly is unclear. In the Perlegen data set, inferred selected SNPs clustered in 1,799 genes. See Fig. 8 and Table 2. Similar results were obtained with the HapMap data sets. See Fig. 8, Table 4 (CEU), Table 6 (CHB), Table 8 (JPT), and Table 10 (JPT). A total of 123 annotated genes showed evidence for selection in both the Perlegen data set and all four HapMap populations. See Tables 12 and 13.

[00113] We examined whether there are predominant biological themes represented among these selected genes, using EASE for the analysis of overrepresentation. See Hosack et al. supra. Similar EASE results were obtained for all populations. As an example, Fig. 4 shows EASE values determined for the 407 HapMap CEU selected genes classifiable under Gene Ontology (GO) Biological Process categories. See Table 4. These 407 HapMap CEU selected genes were classified into 870 overrepresented GO categories (with more than 1 category/gene in some cases). Strikingly, the 870 overrepresented GO categories are less than 1% of the total currently annotated GO categories. [00114] Overall, the observed genes in overrepresented GO categories are not random. For example, six functional categories constitute 82% of the HapMap CEU -log(EASE) scores of >0.65, represented by numbered flags in Fig. 4. We have defined these more general functional categories to include a number of individual GO categories associated with pathogen-host interaction, reproduction, DNA metabolism/cell cycle, protein metabolism, and neuronal function. We emphasize that many genes appear in multiple GO categories, and hence exact classification is not possible. Nevertheless, the clustering of most high-scoring GO categories into one of these generally defined functional categories is striking (Fig. 4). In the 123 genes with evidence for selection in all populations (Table 13), the proportion of genes in each of these categories is as follows: reproduction, 7%; host-pathogen interaction, 10%; cell cycle, 13%; protein metabolism, 15%; neuronal function, 17%; and DNA metabolism (including putative transcription factors), 21%.

[001 IS] Selection for alleles in some of these categories might be anticipated, such as host-pathogen interaction and reproduction, given prior selection studies in humans and other organisms. See, e.g., Vallender et al. (2004), Annu. Rev. Genomics Hum. Genet., 1 :361-385. Pathogen defense has long been suspected to be under constant evolutionary pressure. The beginning of agriculture and animal domestication 10,000 years ago not only brought domesticated animals close to humans but also established permanent human settlements. Such shifts from a hunter-gatherer nomadic lifestyle to agrarian societies likely facilitated the wide spread of infectious agents. See, e.g., Williams et al. (1991), Q. Rev. Biol. , 66 : 1 -22. Our results suggested that human populations may have encountered many selective events associated with pathogen-host interaction. Examples of genes identified under host- pathogen interaction include, e.g., CSF2, CCNT2, DEFBl 18, STABl, SPl, and Zap70, and under reproduction, BIRC6, CUGBPl, DLG3, HMGCR, STS, and XRN2.

[00116] The other overrepresented GO categories contained a number of unexpected genes. For example, it has been suggested that changes in organic compound metabolism may have been influenced by increases in meat consumption by early humans . See, e.g., Finch et al. (2004), Q. Rev. Biol, 79:3-50. Overrepresented genes in protein metabolism could be the result of this shift in dietary composition and/or the profound changes associated with a restricted agrarian diet . See, e.g., Larsen et al. (1995), Annu. Rev. AnthropoL, 24:185-213. The large number of selected genes under DNA metabolism is also unexpected. We suggest that many of these selected alleles may be involved in the recent inferred increase in longevity of humans. See, e.g., Caspari et al. (2004), Proc. Natl. Acad. ScL USA, 101: 10895- 10900. Modifications to our immune system, increases in tumor suppression, and enhanced DNA repair (Fig. 4) are likely molecular components of our unique primate longevity. Some examples of selected genes in protein metabolism include ADAMTS 19-20, APEH, PLAU, HDAC8, UBRl, and USP26, and under DNA metabolism CKNl, FANCC, RAD51C, HDAC8, PDCD8, and SMClLl. [00117] One of the more intriguing categories overrepresented in inferred selective events is neuronal function. We define this category to include a diverse assortment of genes, including the serotonin transporter (SLC6A4), glutamate and glycine receptors (GRM3, GRMl, and GLRA2), olfactory receptors (OR4C13 and OR2B6), synapse-associated proteins (RAPSN), and a number of brain- expressed genes with largely unknown function (ASPM, RTNl; see Fig. 4). Table 13. Loci in which inferred selection is identified in all populations (PLG, CEU, CHB, JPT, and YRI) . HP: Host-Pathogen. RP: Reproduction. DM: DNA Metabolism. CC: Cell-Cycle. PM: Protein Metabolism. NF: Neuronal Function.

[00118] A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method of characterizing a SNP comprising determining a quantitative measure of the probability of selective pressure on the major or minor allele of the SNP, wherein at least part of said determining is performed using a computer.

2. The method of claim 1, comprising selecting a genotype for analysis based on homozygosity of the genotype for the major or minor allele of the SNP.

3. The method of claim 2, wherein the analysis comprises scoring zygosity of SNP loci within a predetermined distance of the SNP to be characterized.

4. The method of claim 3, wherein a plurality of genotypes homozygous for the major allele or a plurality of genotypes homozygous for the minor allele are analyzed.

5. The method of claim 4, wherein the analysis further comprises determining an inferred fraction of recombinant chromosomes for one or more of the SNP loci based on the scoring.

6. The method of claim 1, wherein the SNP is not selected for characterizing based on an association of the SNP with a phenotype.

7. The method of claim 1, further comprising comparing the value of the quantitative measure for the major or minor allele to a predetermined value.

8. The method of claim 1, further comprising comparing the value of the quantitative measure for the major allele to the value of the quantitative measure for the minor allele.

9. The method of claim 1, further comprising selecting or not selecting the characterized SNP for inclusion in a set of SNPs on the basis of value of the quantitative measure.

10. The method of claim 2, wherein a plurality of SNPs are characterized.

11. The method of claim 1, wherein the measure of the probability of selective pressure is determined by analyzing SNPs within a predetermined distance of the SNP to be characterized.

12. The method of claim 11, wherein the predetermined distance is such that, on average, at least an additional 50 SNPs are found in the distance.

13. The method of claim 12, wherein the predetermined distance is such that, on average, at least an additional 100 SNPs are found in the distance.

14. The method of claim 13, wherein the predetermined distance is such that, on average, at least an additional 300 SNPs are found in the distance.

15. The method of claim 11 ,wherein the predetermined distance is at least about 10 kilobases.

16. The method of claim 15, wherein the predetermined distance is at least about 50 kilobases.

17. The method of claim 16, wherein the predetermined distance is about 200 kilobases.

18. The method of claim 17, wherein the predetermined distance is at least about 500 kilobases.

19. The method of claim 18, wherein the predetermined distance is at least about 1000 kilobases.

20. The method of claim 10, wherein the quantitative measure is determined by analyzing SNPs within a predetermined distance of the SNP to be characterized and further comprising determining the fraction of inferred recombinant chromosomes for a plurality of the SNPs found within the predetermined distance.

21. The method of claim 20, further comprising creating a list of value pairs for each of the SNPs in the plurality of SNPs found within the predetermined distance, wherein each value pair in the list comprises a value for the distance away from the site of the SNP to be characterized and the fraction of recombinant chromosomes.

22. The method of claim 21, further comprising computing an average log likelihood (ALnLH) for the major or minor allele based on the sum of the square of the differences between a model of the fraction of recombinant chromosomes within a predetermined distance from a positively selected SNP allele and actual data for inferred fraction of recombinant chromosomes within the predetermined distance from the minor or major alleles of the SNPs to be characterized.

23. The method of claim 22, further comprising comparing the ALnLH with a predetermined value.

24. The method of claim 23, wherein the predetermined value is an average ALnLH (Av- ALnLH) value.

25. The method of claim 24, wherein the ALnLH of the minor allele and the ALnLH of the major allele are compared to the Av-ALnLH value.

26. A nucleic acid array comprising SNP probes, wherein one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by the method of claim 9.

27. The nucleic acid array of claim 26, wherein the one or more probes include at least 30% of non-redundant probe sequences in the array.

28. The nucleic acid array of claim 27, wherein the one or more probes include at least 70% of non-redundant probe sequences in the array.

29. The nucleic acid array of claim 27, wherein the one or more probes includes all of the non-redundant probe sequences in the array.

30. A method of performing an array-based SNP assay, comprising: conducting a nucleic acid array-based assay on a nucleic acid sample from a subject, wherein the nucleic acid array is the nucleic acid array of claim 26.

31. A kit for use in screening a nucleic acid sample for the presence of the major or minor allele of one or more SNPs, said kit comprising a SNP detection reagent and a control nucleic acid sample, wherein the one or more SNPs are SNPs selected by the method of claim 9.

32. A collection of a plurality of different SNP profiles each paired with a specific nucleic acid source, wherein (i) said collection is recorded on a substrate and (ii) the SNP profile includes the genotypes of the SNPs selected according to the method of claim 9.

33. The collection of claim 32, wherein the substrate is a computer readable medium.

34. A method of characterizing a SNP comprising determining a numerical quantity for the major or minor allele of the SNP, wherein the numerical quantity is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of the SNP for the major or minor allele, wherein the analyzing is performed on a genotype that is homozygous for the major or minor allele, and at least part of the determining is performed using a computer.

35. The method of claim 34 wherein the SNP is not selected on the basis of an association with a phenotype.

36. The method of claim 34, further comprising comparing the numerical quantity determined for the major or minor allele to a predetermined value.

37. The method of claim 34, further comprising comparing the numerical quantity determined for the major allele to the numerical quantity determined for the minor allele.

38. The method of claim 34, further comprising selecting or not selecting the SNP for inclusion in a set of SNPs on the basis of value of the quantitative measure.

39. The method of claim 34, wherein a plurality of SNPs are characterized.

40. The method of claim 34, wherein the predetermined distance is such that, on average, at least an additional 50 SNPs are found in the distance.

41. The method of claim 40, wherein the predetermined distance is such that, on average, at least an additional 100 SNPs are found in the distance.

42. The method of claim 41, wherein the predetermined distance is such that, on average, at least an additional 300 SNPs are found in the distance.

43. The method of claim 34,wherein the predetermined distance is at least about 10 kilobases.

44. The method of claim 43, wherein the predetermined distance is at least about 50 kilobases.

45. The method of claim 44, wherein the predetermined distance is about 200 kilobases.

46. The method of claim 45, wherein the predetermined distance is at least about 500 kilobases.

47. The method of claim 46, wherein the predetermined distance is at least about 1000 kilobases.

48. The method of claim 34, further comprising creating a list of value pairs for each of the SNPs in the plurality of SNPs found within the predetermined distance, wherein each value pair in the list comprises a value for the distance away from the site of the SNP to be characterized and the inferred fraction of recombinant chromosomes.

49. The method of claim 48, further comprising computing an average log likelihood (ALnLH) of the major or minor allele based on the sum of the square of the differences between a model of the inferred fraction of recombinant chromosomes within a predetermined distance from a positively selected SNP allele and actual data for inferred fraction of recombinant chromosomes within the predetermined distance from the minor or major alleles of the SNPs to be characterized.

50. The method of claim 49, further comprising comparing the ALnLH with a predetermined value.

51. The method of claim 50, wherein the predetermined value is an average ALnLH (Av- ALnLH) value.

52. The method of claim 51, wherein the ALnLH of the minor allele and the ALnLH of the major allele are compared to the Av-ALnLH value.

53. A nucleic acid array comprising SNP probes, wherein one or more of the probes are nucleic acid sequences that, under high stringency hybridization conditions, selectively hybridize with and discriminate between the nucleic acid sequences of the minor or major alleles of the SNPs set selected by the method of claim 38.

54. The nucleic acid array of claim 53, wherein the one or more probes include at least 30% of non-redundant probe sequences in the array.

55. The nucleic acid array of claim 54, wherein the one or more probes include at least 70% of non-redundant probe sequences in the array.

56. The nucleic acid array of claim 55, wherein the one or more probes includes all of the non-redundant probe sequences in the array.

57. A method of performing an array -based SNP assay, comprising: conducting a nucleic acid array-based assay on a nucleic acid sample from a subject, wherein the nucleic acid array is the nucleic acid array of claim 53.

58. A kit for use in screening a nucleic acid sample for the presence of the major or minor allele of one or more SNPs, said kit comprising a SNP detection reagent and a control nucleic acid sample, wherein the one or more SNPs are SNPs selected by the method of claim 38.

59. A collection of a plurality of different SNP profiles each paired with a specific nucleic acid source, wherein (i) said collection is recorded on a substrate and (ii) the SNP profile includes the genotypes of the SNPs selected according to the method of claim 38.

60. The collection of claim 59, wherein the substrate is a computer readable medium.

61. A set of SNP alleles wherein one or more of the SNP alleles are weighted by a numerical value that indicates a probability of selective pressure on the one or more SNP alleles, wherein said set is stored on a computer database.

62. The set of claim 61, wherein substantially all the SNPs in the set are assigned a numerical value.

63. The set of SNP alleles of claim 61 , wherein the numerical value is determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, wherein the analyzing is performed on a genotype that is homozygous for the major or minor allele, and at least part of the determining is performed using a computer.

64. A subset of SNP alleles selected from a larger set of SNP alleles, wherein the subset of SNP alleles is selected from the larger set based on numerical values assigned to the SNP alleles in the larger set, wherein said numerical values are related to the degree of selective pressure on each of the SNP alleles, and wherein said assigning is performed at least in part using a computer.

65. The subset of SNP alleles of claim 64, wherein the numerical values are determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each SNP allele in the subset, wherein the analyzing is performed on a genotype that is homozygous for the major or minor allele, and at least part of the determining is performed using a computer.

66. The subset of SNP alleles of claim 65, wherein the subset comprises an allele having an ALnLH value that is greater than 2.6 standard deviations away from the Av-ALnLH value of the entire Perlegen SNP allele dataset or the Av-ALnLH of the entire HapMap SNP allele dataset.

67. The subset of claim 66 wherein the subset contains a SNP allele that is found within at least 100 kb of a gene selected from the group consisting of IL1RAPL2, FOLHl, KIAA 1463, PRRGl, S66645/TYR, TTBK2, AK096379, BC034574, AUTS2, BX537851, COHl, COL4A6, CSEN, GALK2, GLRA2, HCNl, OPHNl, OR4A5, OR4C13, OR4X1, OR5AS1, OR8I2, OR8K1, PR0M2, RAPSN, RTNl, SNTGl, PSMC3, SERPINC 1 , ADK, AF035029, AF035035, AP3M1, CSMD3, D63480, F8, FLJ42925, LRBA, SCAPl, SPIl, SUPT3H, NCOA6, AK091585, AK131264, AK131417, AK097440, AK131413, ECM2, POTE8, RNF18, STS, SLC39A13, ZBTB37, ZNF192, ZNF193, BRODL, CSElL, FANCC, PARP4, PNUTL2, TP53INP2, AF318371, BC046415, CPEB3, DJ467N11.1, HGNT-IV-H, HKRl, KHDRBS2, MGC46496, NT5C2, RAD51C, ZCWCC2, ZNF2, ZNF322A, ZNF37A, ZNF514, ZNF569, ZNF570, C15orfl6, C6.1A, MAST2, PPMlE, UBRl, AB037807/CCM1, FLJ14442, HDHDlA, EVIMP2L, LOC220594, MTMR4, TEX14, AF318346, AKAP9, CDC91L1, FUNDC2, MAGEDl, SFIl, AB058732, AF116680, AK090675, AK125992, ASCC3, BC027488, C10orf68, CPNE8, FAM46D, FAM47A, FLJ32191, FLJ33979, FLJ46156, GBA, GPR23, HPSE2, KIAA0377, LOC492307, LOC51057, MAGEB6, MGC12197, MGC35232, MTCH2, OCIADl, RABGAPlL, SH3BGRL, VCL.

68. The subset of claim 66, wherein the subset contains at least 10 SNP alleles found within the 100 kb of the gene.

69. The subset of claim 68, wherein the subset contains at least 500 SNP alleles found within the 100 kb of the gene.

70. The subset of claim 69, wherein the subset contains at least 1500 SNP alleles found within the 100 kb of the gene.

71. The subset of claim 66, wherein at least 10% of the alleles in the subset are found within 100 kb of the gevne.

72. The subset of claim 71, wherein at least 50% of the alleles in the subset are found within 100 kb of the gene.

73. The subset of claim 72, wherein all of the alleles in the subset are found within 100 kb of the gene.

74. A method of determining a subset of SNPS with predictive value for a phenotype in a set of SNPs comprising

(i) determining on case and control samples the relative frequency of one or more alleles of one or more SNPs, wherein said SNPs are SNPs determined to have a major or minor allele that has a high probability of having been subjected to selective pressure;

(ii) for each SNP for which a frequency is determined in step (i), comparing the frequency of the occurrence of the SNP in the case population and in the control population; and

(iii) selecting for inclusion in the subset of SNPs those SNPs for which a major or minor allele frequency estimate in the case samples differs from the major or minor allele frequency in the control samples by at least 1.5 standard deviations; wherein said method is carried out at least in part using a computer.

75. The method of claim 74, wherein in step (i) the major or minor alleles are determined to have a high probability of having been subjected to selective pressure by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, wherein the analyzing is performed on a genotype that is selected for analysis on the basis of homozygosity for the major or minor allele.

76. A method of determining a predictive value for a phenotype for a SNP in a set of SNPs, wherein the SNPs in the set of SNPs each have been assigned a numerical value based on the probability of selective pressure on that SNP, comprising

(i) determining on case and control samples the relative frequency of the major and minor allele of the SNP and comparing the frequency of the occurrence of the major and minor allele of the SNP in the case population and in the control population; and

(ii) combining the results of step (i) with the numerical value for the SNP to assign to the SNP a predictive value for the phenotype; wherein said method is carried out at least in part using a computer.

77. The method of claim 76, wherein the probability of selective pressure has been determined by analyzing inferred frequency of recombination of a plurality of SNPs within a predetermined distance from the site of each of the SNPs, wherein the analyzing is performed on a genotype that is selected for analysis on the basis of homozygosity for the major or minor allele.

78. The method of claim 76 or 77, wherein the case samples are from individuals with one or more phenotypic characteristics of a pathological condition.

79. The method of claim 78, wherein the pathological condition is a neurodegenerative disease, a psychiatric disease, a metabolic disorder, a cardiovascular disease, an infectious disease, or a cancer.

80. The method of claim 79, wherein the pathological condition is a neurodegenerative disease.

81. The method of claim 80, wherein the neurodegenerative disease is Alzheimer's disease, Pick's disease, Lewy body dementia, or corticobasal degeneration.

82. The method of claim 81, wherein the neurodegenerative disorder is Alzheimer's disease.

83. A method of determining a diagnosis, prognosis, or status of treatment for an individual, comprising

(i) determining the identity of the alleles for a SNP from a sample obtained from the individual, wherein each allele of the SNP has been assigned a value based on a quantitative measure of the probability of selective pressure on the SNP, wherein said identity and/or said value are determined by a method that comprises the use of a computer; and

(ii) determining a diagnosis for the individual based on the identity of the allele or alleles for the SNP, wherein said method is carried out at least in part using a computer.

84. The method of claim 83, wherein the SNP is assigned a weighted value based on the quantitative measure of selective pressure on the SNP, and said diagnosis is based on an analysis of a combination of the weighted value and the identity of the allele or alleles for the SNP, wherein said method is carried out at least in part using a computer.

85. The method of claim 83 or 84, wherein the identities of the alleles for a plurality of SNPs are determined, wherein each allele of each the SNPs has been assigned a value based on a quantitative measure of selective pressure on the SNP, and a diagnosis is determined based on the identity of the alleles for said plurality of SNPs.

86. A method of stratifying a population of individuals, wherein said stratification is based on likelihood of exhibiting a phenotype, and wherein the method comprises

(i) determining the identity of the alleles for a SNP for the individuals, wherein each allele of the SNP has been assigned a value based on a quantitative measure of selective pressure on the SNP; and

(ii) determining the position for the individual in the stratification based on the identity of the allele or alleles for the SNP, wherein said method is carried out at least in part using a computer.

87. The method of claim 86, wherein the identities of the alleles for a plurality of SNPs are determined for the individual, wherein each allele of each the SNPs has been assigned a value based on a quantitative measure of selective pressure on the SNP, and the position for the individual in the stratification is determined based on the identity of the alleles for said plurality of SNPs.

88. The method of claim 86, wherein said phenotype is response to a treatment.

89. The method of claim 88, wherein said treatment comprises administration of a drug.

90. The method of claim 89, wherein the response comprises a therapeutic response to administration of the drug.

91. The method of claim 90, wherein the response comprises a side effect of the drug.

92. The method of claim 91, wherein the side effect is a negative side effect.

93. A business method comprising: a) collecting more than 10 case samples representing a clinical phenotypic state and more than 10 control samples representing patients without said clinical phenotypic state; b) detecting in each sample the presence or absence of one or more SNP alleles selected from the subset of SNP alleles of claim 64 or selected by the method of claim 9 or claim 38; b) identifying representative patterns of the occurrence of the selected SNP alleles that distinguish datasets from case samples and control samples; c) marketing diagnostic products that use said representative patterns to identify said phenotypic or a predisposition to said phenotypic state with a disposable device; and e) selling said disposable device.

94. The method of claim 93, wherein said products are marketed in a clinical reference laboratory.

95. The method of claim 93, wherein the marketing step markets kits.

96. The method of claim 95, wherein said kits are FDA approved kits.

97. The method of claim 93, wherein said phenotypic state is a drug response phenotype and further comprising the step of collecting a royalty on said drug.

98. The method of claim 93, further comprising the step of collecting said samples in collaboration with a collaborator.

99. The method of claim 98, wherein said collaborator is an academic collaborator.

100. The method of claim 98, wherein said collaborator is a pharmaceutical company.

101. The method of claim 100, wherein said pharmaceutical company collects said samples in a clinical trial.

102. The method of claim 101, wherein said patterns are used to segregate a drug response phenotype.

103. The method of claim 102, further comprising the step of collecting royalties on said drug.

104. The method of claim 100, wherein the step of marketing diagnostic products is performed by the same company as the company performing the identifying step.

105. The method of claim 93, wherein at least 50 of the selected SNP alleles are detected.

106. The method of claim 105, wherein at least 100 of the selected SNP alleles are detected.

107. The method of claim 106, wherein at least 500 of the selected SNP alleles are detected.

108. The method of claim 104, wherein at least 1000 of the selected SNP alleles are detected.

109. The method of claim 108, wherein at least 2000 of the selected SNP alleles are detected.

110. The method of claim 109, wherein at least 10,000 of the selected SNP alleles are detected.

111. The method of claim 93 , wherein said marketing step markets a nucleic acid array detection system used to identify said representative states in patient samples.

112. The method of claim 93, wherein more than 50 of said cases samples and 50 of said control samples are used.

113. The method of claim 93, wherein more than 100 of said case samples and 100 of said control samples are used.

114. The method as recited in claim 93, wherein said diagnostic products use a nucleic acid array platform.

115. The method of claim 93, wherein said diagnostic products are marketed with a nucleic acid array.

116. The method of claim 93, wherein said diagnostic products are marketed by a diagnostic partner.

117. The method of claim 93, wherein said phenotype is a drug response phenotype.

118. The method of claim 93, wherein said phenotype is a drug resistance phenotype.

119. 34. The method as recited in claim 1 wherein said phenotype is a disease stage phenotype.

120. The method of claim 93, wherein said phenotype is a disease state phenotype.

121. The method of claim 93, wherein said phenotype is a treatment selection phenotype.

122. The method of claim 93, wherein said phenotype is a disease diagnostic phenotype.

123. The method of claim 93, wherein said phenotype is a drug toxicity phenotype.

124. The method of claim 93, wherein said phenotype is an adverse drug response phenotype.

125. The method of claim 93, wherein revenue is derived from sales of nucleic acid arrays, informatics tools, patterns and/or computer programs for classifying samples and/or from services that provide diagnostic information and/or pattern discovery and/or validation.