CA2731830A1 - Method of characterizing sequences from genetic material samples - Google Patents

Method of characterizing sequences from genetic material samples Download PDF

Info

Publication number
CA2731830A1
CA2731830A1 CA2731830A CA2731830A CA2731830A1 CA 2731830 A1 CA2731830 A1 CA 2731830A1 CA 2731830 A CA2731830 A CA 2731830A CA 2731830 A CA2731830 A CA 2731830A CA 2731830 A1 CA2731830 A1 CA 2731830A1
Authority
CA
Canada
Prior art keywords
genetic material
snp
sample
int
snps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA2731830A
Other languages
French (fr)
Inventor
David Craig
Nils Homer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of California
Translational Genomics Research Institute TGen
Original Assignee
University of California
Translational Genomics Research Institute TGen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of California, Translational Genomics Research Institute TGen filed Critical University of California
Publication of CA2731830A1 publication Critical patent/CA2731830A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium

Abstract

Among other aspects provided herein is a method describing the use of Single Nucleotide Polymorphism (SNP) genotyping microarrays to resolve whether genetic material (such as genomic DNA) derived from a particular individual is present in a genetic material mixture (such as a complex genomic DNA mixture) is disclosed. Furthermore, it is demonstrated that the identification of the presence of genetic material (such as genomic DNA) of specific individuals within a series of complex genomic mixtures is possible.

Description

TGEN.001 VPC PATENT
METHOD OF CHARACTERIZING SEQUENCES FROM

GENETIC MATERIAL SAMPLES
CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority to U.S. Provisional Application No. 61/082,912, filed July 23, 2008, which is hereby incorporated by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED R&D
[0002] The US government retains certain rights in this invention as provided by the terms of grant number 5U01HLO86528 awarded by the National Institutes of Health.

COPYRIGHT NOTICE
[0003] A portion of the disclosure of this patent document contains material which is subject to (copyright or mask work) protection. The (copyright or mask work) owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all (copyright or mask work) rights whatsoever.

FIELD OF THE INVENTION

[0005] The present disclosure relates to systems and methods for using multiple single nucleotide polymorphisms (SNPs) for characterizing genetic material in a sample.

BACKGROUND OF THE INVENTION

[0006] Resolving whether an individual's genetic material is present within a complex mixture containing genetic material (such as DNA) from numerous individuals is of interest to multiple fields. For example, within forensics, determining whether a person contributed their genetic material to a mixture is typically a skilled process. In large part, forensically identifying whether a person is contributing less than 10% of the total genomic DNA to a mixture is not easily done, is difficult to automate, and is highly confounded with the inclusion of more individuals.
[00071 Numerous methods examining DNA mixtures currently exist, most of these addressing mixtures with smaller numbers of individuals within forensics studies (See Egeland, T., Dalen, I. & Mostad, P.F. Estimating the number of contributors to a DNA profile. Int JLegal Med 117, 271-275 (2003); Hu, Y.Q. & Fung, W.K.
Interpreting DNA mixtures with the presence of relatives. Int J Legal Med 117, 39-45 (2003); and Balding, D.J. Likelihood-based inference for genetic correlation coefficients.
Theor Popul Biol 63, 221-230 (2003)). Using short tandem repeats (STR) is a common method to generate DNA genotyping profiles and allows for identification of the various alleles and their relative quantity within the mixture (See Clayton, T.M., Whitaker, J.P., Sparkes, R.
& Gill, P. Analysis and interpretation of mixed forensic stains using DNA STR
profiling.
Forensic Sci Int 91, 55-70 (1998); Cowell, R.G., Lauritzen, S.L. & Mortera, J.
Identification and separation of DNA mixtures using peak area information.
Forensic Sci Int 166, 28-34 (2007); Pearson, J.V. et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet 80, 126-139 (2007); and Bill, M. et al.
PENDULUM--a guideline-based approach to the interpretation of STR mixtures.
Forensic Sci Int 148, 181-189 (2005)). Frequently, STRs on the Y chromosome are useful when resolving the male components of the mixture (See Jobling, M.A. &
Gill, P.
Encoded evidence: DNA in forensic analysis. Nat Rev Genet 5, 739-751 (2004)).
Nevertheless, these methods based on STRs expectedly suffer from limited power when using severely degraded DNA (See Jobling, M.A. & Gill, P. Encoded evidence:
DNA in forensic analysis. Nat Rev Genet 5, 739-751 (2004); and Ladd, C., Lee, H.C., Yang, N. &
Bieber, F.R. Interpretation of complex forensic DNA mixtures. Croat Med J 42, (2001)). Mitochondrial DNA (mtDNA) based on hypervariable region sequencing is useful when analyzing degraded DNA due to its high copy number and improved stability. Profiles derived from mtDNA can also be combined with STR analysis to acheive better identification (See Goodwin, W., Linacre, A. & Vanezis, P. The use of mitochondrial DNA and short tandem repeat typing in the identification of air crash victims. Electrophoresis 20, 1707-1711 (1999)). Nonetheless, mtDNA has weaknesses, including the uniparental mode of inheritance and lower discrimination power that can be moderately mediated by using the whole mitochondrial genome or known surrounding single nucleotide polymorphisms (SNPs) (See Coble, M.D. et al. Single nucleotide polymorphisms over the entire mtDNA genome that increase the power of forensic testing in Caucasians. Int J Legal Med 118, 137-146 (2004) and Parsons, T.J. & Coble, M.D.
Increasing the forensic discrimination of mitochondrial DNA testing through analysis of the entire mitochondrial DNA genome. Croat Med J 42, 304-309 (2001)).
Informative SNPs have been used to help resolve problems with using mtDNA (See Coble, M.D.
et al.
Single nucleotide polymorphisms over the entire mtDNA genome that increase the power of forensic testing in Caucasians. Int J Legal Med 118, 137-146 (2004); Just, R.S. et al.
Toward increased utility of mtDNA in forensic identifications. Forensic Sci Int 146 Suppl, S147-149 (2004); and Vallone, P.M., Just, R.S., Coble, M.D., Butler, J.M. &
Parsons, T.J. A multiplex allele-specific primer extension assay for forensically informative SNPs distributed throughout the mitochondrial genome. Int J Legal Med 118, 147-157 (2004)) but have not been used wholly or separately as the discriminatory factor, or on the same scale as provided herein.
[0008] Aspects and applications of the invention presented here are described below in the drawings and detailed description of the invention.

SUMMARY OF THE INVENTION

[0009] Some of the present embodiments provide a variety of methods (and apparatuses for implementing these methods), for determining if a subject's genetic material is present in a genetic material sample (a "test genetic material sample). While there are a variety of techniques by which this can be achieved, in some embodiments, this is achieved by determining if there is a bias and/or direction of an allele occurrence and/or frequency within a collection of single nucleotide polymorphisms (SNPs) of the test genetic material sample relative to a reference and/or the subject's SNP
signature or collection of SNPs genotypes.
[0010] In some embodiments, a system for determining if a subject contributed genetic material to a sample is provided. The system can comprise an input module configured to allow the input of one or more of a sample SNP signature, a reference SNP signature, and a subject SNP signature; a module configured to determine a bias of an allele frequency within SNPs of the sample SNP
signature relative to the reference SNP signature and the subject SNP signature; and a module configured to output the bias, wherein one or more of the modules is executed on a computing device.
[0011] In some embodiments, a method for determining if a person of interest contributed genetic material to a test genetic material sample is provided.
The method can comprise determining a bias of an allele frequency within SNPs of the test genetic material sample relative to a reference and a subject's SNP signature.
[0012] In some embodiments, a method of characterizing a test genetic material sample to determine if a person of interest's ("POI's") genetic material is within the test genetic material sample is provided. The method can comprise providing a SNP
analysis of the test genetic material sample; providing a SNP analysis of a reference genetic material sample; providing a SNP analysis of a POI's genetic material;
in a first comparison, comparing the SNP analysis of the test genetic material sample to the SNP
analysis of the POI's genetic material; in a second comparison, comparing the SNP
analysis of the reference genetic material to the SNP analysis of the POI's genetic material; and comparing the first and second comparisons, thereby determining if the POI's genetic material is likely in the test genetic material sample.
[0013] In some embodiments, a method of characterizing a test genetic material sample is provided. The method can comprise providing a first allele frequency for a SNP for a person of interest (POI); providing a second allele frequency for the SNP
from a reference population(s) of genetic material; providing a third allele frequency for the SNP for the test genetic material sample; repeating the above processes for at least 10 different SNPs; and analyzing the first, second, and third allele frequencies to characterize the test genetic material sample.
[0014] In some embodiments, a method for determining a likelihood that a subject contributed genetic material to a test genetic material sample is provided. The method can comprise providing a test genetic material sample; performing a single nucleotide polymorphism analysis on the test genetic material sample, whereby at least 50 different single nucleotide polymorphisms in said test genetic material sample are analyzed, thereby creating a sample SNP signature; and comparing the sample SNP
signature to a subject's SNP signature to determine a likelihood that the subject contributed genetic material to a test genetic material sample.
[0015] Previously, within the field of forensics, as well as the field of human genetics, there was a base assumption that it was not possible to identify individuals using pooled data (e.g. allele frequency) from SNP data. Some of the embodiments provided herein provide methods of using hundreds or thousands of SNPs (optionally assayed on a high-density microarray) to resolve trace contributions of DNA (or other genetic material) to a complex mixture. In some embodiments, this can specifically exploit raw allele intensity measures in the analysis of DNA with mixed samples and a genotype calling algorithm to digitize the inherently analog information derived from an SNP
assay (See, e.g., Kennedy, G.C. et al. Large-scale genotyping of complex DNA. Nat Biotechnol 21, 1233-1237 (2003)).
[0016] In some embodiments, the invention relates generally to single nucleotide polymorphism genotyping and more specifically to single nucleotide polymorphism genotyping of samples from multiple individuals and/or sources.
[0017] In some embodiments, the method comprises a sample SNP signature that is from a biopsy from a subject, wherein the biopsy from the subject is to be tested for the presence of a cancer. In some embodiments, the sample SNP signature is created from a female who wants to determine if she is pregnant. In some embodiments, the subject's SNP signature is a viral DNA signature.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0018] A more complete understanding of various embodiments of the present inventions can be derived by referring to the detailed description when considered in connection with the following illustrative figures. In the figures, like reference numbers refer to like elements or acts throughout the figures.
[0019] FIG. 1A. To give insight into the intuition behind come embodiments of the various methods, three different scenarios are presented per SNP of the possible allele frequency of the person of interest corresponding to the genotypes AA, AB, and BB. The allele frequencies of the reference population, person of interest (subject), and the mixture are described as M; (test genetic material sample), Y, (subject), and Pop;
(reference population) respectively. The distance measure is greater (and positive) when the Y; of the person of interest is closer to the M; of the mixture than to the Pop, of the reference population. Similarly, the distance measure is smaller (and negative) when the Y, of the person of interest is closer to the Pop, of the reference population than to M, of the mixture. the test statistic is then the z-score using this distance measure.
[0020] FIG. 1B is a flow chart depicting various possible processes involved in some embodiments described herein.
[0021] FIGS. 2A - 2C depict various simulation results: Using 1423 Wellcome Trust 58C individuals, log scaled p-values were given from simulations based off of three variables: the number of SNPs (s), the fraction of the individual in the mixture (f), and the probe variance (vp). The graphs plot the relationships between the three variables with a different variable fixed in each graph. The log scaled p-values are represented by the shading of each point in the graph, as well as the z-axis on the right graphs. These simulations indicate that one can resolve mixtures where a given individual is 0.1% of the mixture (f), probe variance is at most 0.01 (vp) and the number of SNPs probed is 50,000 (s).
[0022] FIGS. 3A - 3D provide the results from a series of experiments.
Experimental validation using a series of mixtures (see Table 1, A-F) assayed on the Affymetrix GeneChip 5.0, Illumina BeadArray 550 and the Illumina 450S Duo Human BeadChip. The x-axis shows each individual in the CEU HapMap population, the left y-axis shows the p-value (log scaled), and the right y-axis shows the value of the test statistic. With regard to mixtures A, B, E and F those in the mixture are shaded light and identified and those not in the mixture are shaded darker and identified. With regard to mixtures C and D those individuals who are not in the mixtures are shaded darkly and identified, those individuals who are related to the I% or 10% individuals in the mixtures are shaded lighter and identified as "1-10", those individuals who are related to the 90%
or 99% are shaded lighter still and identified as "90-99", and those people in the mixture are shaded lighter than those absent from the mixture and are identified. In all mixtures, the identification of the presence of a person's genomic DNA was possible. An arrow denotes identification of numerous (or a cluster) of data points while a line denotes identification of a specific data point. Unless otherwise specified, an unmarked data point is part of the closest denoted cluster.
[0023] Elements and acts in the figures are illustrated for simplicity and have not necessarily been rendered according to any particular sequence or embodiment.
DETAILED DESCRIPTION OF THE INVENTION

[0024] In the following description, and for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the various aspects of the invention. It will be understood, however, by those skilled in the relevant arts, that the present embodiments can be practiced without these specific details.

In other instances, known structures and devices are shown or discussed more generally in order to avoid obscuring the invention. In many cases, a description of the operation is sufficient to enable one to implement the various forms of the invention, particularly when the operation is to be implemented in software. It should be noted that there are many different and alternative configurations, devices and technologies to which the disclosed inventions may be applied. The full scope of the various embodiments and the inventions themselves are not limited to the examples that are described below. The present application is being filed along with a computer program listing appendix, which appears prior to the claims.
[00251 The present disclosure provides a variety of methods (and apparatuses for implementing these methods), for determining if a subject's genetic material is present in a genetic material sample (a "test genetic material sample). While there are a variety of techniques by which this can be achieved, in some embodiments, this is achieved by determining if there is a bias and/or direction of an allele occurrence and/or frequency within SNPs of the test genetic material sample relative to a reference and/or the subject's SNP signature (e.g., SNP genotype). Among other aspects provided herein is a method describing the use of Single Nucleotide Polymorphism (SNP) genotyping microarrays to resolve whether genetic material (such as genomic DNA) derived from a particular individual is present in a genetic material mixture (such as a complex genomic DNA
mixture). Furthermore, the results presented herein demonstrate that the identification of the presence of genetic material (such as genomic DNA) of specific individuals within a series of highly complex genomic mixtures, including mixtures where an individual contributes less than 0.1% of the total genetic material (such as genomic DNA) is possible. These findings shift the perceived utility of SNPs in the identification of individual trace contributors within a forensics mixture and demonstrates the viability of previously sub-optimal DNA sources due to sample contamination. These findings also indicate that composite statistics across cohorts, such as allele frequency or genotype counts, do not mask identity within genome-wide association studies.
[00261 While SNPs and high-density SNP genotyping arrays have been around for some time, their use has been predominately been developed as tools geneticists use to identify common genetic variants that predispose an individual to disease. Some embodiments disclosed herein allow for the use of SNPs to identify the presence or absence of one or more individuals' genetic material in a sample.
[0027] In some embodiments, the SNP based analysis can be used for analyzing forensic mixtures. SNPs are traditionally analyzed by genotype (e.g.
AA, AT, or TT) and, prior to the present disclosure, were thought to be non-ideal in resolving mixtures. It has been argued that their poor performance in the analysis of mixed DNA
samples is one of the primary reasons SNP genotyping arrays have not become adopted by the forensics community (See Jobling, M.A. & Gill, P. Encoded evidence: DNA
in forensic analysis. Nat Rev Genet 5, 739-751 (2004) and Kidd, K.K. et al.
Developing a SNP panel for forensic identification of individuals. Forensic Sci Int 164, 20-32 (2006)).
Other methods have employed match probability estimation after inferring genotypes using STRs where the probability of two unrelated individuals sharing a combination of markers is assessed (See Jobling, M.A. & Gill, P. Encoded evidence: DNA in forensic analysis. Nat Rev Genet 5, 739-751 (2004)). Exclusion probabilities give a calculation based on the probability of excluding a random individual (See Chakraborty, R., Meagher, T.R. & Smouse, P.E. Parentage analysis with genetic markers in natural populations. I. The expected proportion of offspring with unambiguous paternity.
Genetics 118, 527-536 (1988)). Nevertheless, many of these methods rely on assuming the number of individuals in the mixture (See Egeland, T., Dalen, I. & Mostad, P.F.
Estimating the number of contributors to a DNA profile. Int J Legal Med 117, (2003)) and have been applied only to STR markers. In some embodiments, one need not know or estimate the number of individuals that contributed to a mixture when using the methods disclosed herein.
[0028] Likelihood ratios are commonly used when testing which hypothesis is favored by the evidence or DNA samples (See Weir, B.S. et al. Interpreting DNA
mixtures. J Forensic Sci 42, 213-222 (1997)). In some embodiments, one can compute the likelihood ratio of two hypotheses: the individual contributes to the mixture and the individual does not contribute to the mixture. In some embodiments, the proper prior odds ratio can then be given based on the current situation or context, and then would be combined with the likelihood ratio to give a posterior odd ratio. In some embodiments, one can then use SNP microarrays to determine allele frequencies or allele counts. This is especially advantageous since training datasets such as from the HapMap Project or 1000 Genomes project are readily available and could be used to calculate the probability of the observed mixture's allele frequency or individual of interest's genotype. In some embodiments, the Bayesian approach includes creation of explicit hypotheses, estimation of the total fraction of the individual of interest that contributes to the mixture, inclusion of multiple ancestral backgrounds across ancestrally informative SNPs, and inclusion of the possibility that related individuals are within the mixture.
[0029] The present disclosure presents a detailed description of some of various embodiments noted above, as well as additional embodiments. The following section briefly outlines some of the various terms, and is followed by a more detailed description of some of the proof of principle and exemplary embodiments for some of the techniques. Following this section is a selection of various additional embodiments for the various components and/or parts of some of the embodiments, which is followed by a set of examples for some of the various embodiments.

DEFINITIONS
[0030] The section headings used herein are for organizational purposes only and are not to be construed as limiting the described subject matter in any way. All literature and similar materials cited in this application, including but not limited to, patents, patent applications, articles, books, treatises, and internet web pages are expressly incorporated by reference in their entirety for any purpose. When definitions of terms in incorporated references appear to differ from the definitions provided in the present teachings, the definition provided in the present teachings shall control. It will be appreciated that there is an implied "about" prior to the temperatures, concentrations, times, etc discussed in the present teachings, such that slight and insubstantial deviations are within the scope of the present teachings herein. In this application, the use of the singular includes the plural unless specifically stated otherwise. Also, the use of "comprise", "comprises", "comprising", "contain", "contains", "containing", "include", "includes", and "including" are not intended to be limiting. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention. The term "and/or" denotes that the provided possibilities can be used together or be used in the alternative. Thus, the term "and/or" denotes that both options exist for that set of possibilities.
[0031] Unless otherwise defined, scientific and technical terms used in connection with the invention described herein shall have the meanings that are commonly understood by those of ordinary skill in the art. Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular. Generally, nomenclatures utilized in connection with, and techniques of, cell and tissue culture, molecular biology, and protein and oligo- or polynucleotide chemistry and hybridization described herein are those well known and commonly used in the art. Standard techniques are used, for example, for genetic material (nucleic acid) purification and preparation, chemical analysis, recombinant nucleic acid, and oligonucleotide synthesis. Enzymatic reactions and purification techniques are performed according to manufacturer's specifications or as commonly accomplished in the art or as described herein. The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification. See, e.g., Sambrook et al., Molecular Cloning: A
Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.
2000). The nomenclatures utilized in connection with, and the laboratory procedures and techniques of described herein are those well known and commonly used in the art.
[0032] The inventors are fully aware that they can be their own lexicographers if desired. The inventors expressly elect, as their own lexicographers, to use only the plain and ordinary meaning of terms in the specification and claims unless they clearly state otherwise and then further, expressly set forth the "special" definition of that term and explain how it differs from the plain and ordinary meaning. Absent such clear statements of intent to apply a "special" definition, it is the inventors' intent and desire that the simple, plain and ordinary meaning to the terms be applied to the interpretation of the specification and claims.
[0033] As utilized in accordance with the embodiments provided herein, the following terms, unless otherwise indicated, shall be understood to have the following meanings:
[0034] The term "genetic material" refers to natural nucleic acids, artificial nucleic acids, non-natural nucleic acid, orthogonal nucleotides, analogs thereof, or combinations thereof. Genetic material can also include analogs of DNA or RNA
having modifications to either the bases or the backbone. For example, genetic material, as used herein, includes the use of peptide nucleic acids (PNA). The term "genetic material" also includes chimeric molecules. The genetic material can include, consist, or consist essentially of a nucleic acid of one or more strands of single and/or double stranded material. Genetic material from a subject is generally (unless noted otherwise) numerous strands and numerous genes, and in some embodiments, can include the entire genome of the subject. In some embodiments, genetic material comprises, consists or consists essentially of nucleic acids.
[0035] In some embodiments, the genetic material is from a subject that someone wishes to determine the presence or absence of in a test genetic material sample.
Exemplary genetic materials include DNA, RNA, mRNA, and miRNA. In some embodiments, the genetic material and/or the test genetic material sample comprises, consists, or consists essentially of DNA, RNA, mRNA, miRNA, and any combination thereof. In some embodiments, the genetic material is contained within the test genetic material sample. In other embodiments, the genetic material is not contained within the test genetic material sample. The genetic material can be one or more strands.
In some embodiments, the target genetic material comprises a representative selection of nucleic acids. In some embodiments, the target genetic material comprises a genome wide selection of nucleic acids. Unless explicitly noted otherwise, the term "genetic material"
can be singular and/or plural (that is, "genetic material" can, for example, denote genetic material from one or more sources).
[0036] As used herein, the terms "polynucleotide," "oligonucleotide," and "nucleic acid oligomers" are used interchangeably and mean single-stranded and double-stranded polymers of nucleic acids, including, but not limited to, 2'-deoxyribonucleotides (nucleic acid) and ribonucleotides (RNA) linked by internucleotide phosphodiester bond linkages, e.g. 3'-5' and 2'-5', inverted linkages, e.g. 3'-3' and 5'-5', branched structures, or analog nucleic acids. Polynucleotides have associated counter ions, such as H+, NH4, trialkylammonium, Mgt+, Na+ and the like. A polynucleotide can be composed entirely of deoxyribonucleotides, entirely of ribonucleotides, or chimeric mixtures thereof.
Polynucleotides can be comprised of nucleobase and sugar analogs.
Polynucleotides typically range in size from a few monomeric units, e.g. 5-40 when they are more commonly frequently referred to in the art as oligonucleotides, to several thousands of monomeric nucleotide units. Unless denoted otherwise, whenever a polynucleotide sequence is represented, it will be understood that the nucleotides are in 5' to 3' order from left to right and that "A" denotes deoxyadenosine, "C" denotes deoxycytidine, "G"
denotes deoxyguanosine, and "T" denotes thymidine.
[0037] The term "reduce" denotes some decrease in amount. In some embodiments, an event is reduced by 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 96, 97, 98, 99, 99.9, 99.99, 99.999, percent or more, including any value above any of the preceding values, as well as any range defined between any two of the preceding values.
[0038] For the present application, the term "whole genome" means "genome wide" rather than requiring that the entire genome of any organism be present.
Genome wide indicates that there is a sufficient variety and selection of various nucleic acids throughout an organism's genome for the technique being performed. The genome wide selection can be random, throughout an organism's genome, or biased to specific areas.
In some embodiments, the genome wide selection is biased to those areas with the specific SNPs to be investigated. In some embodiments it is possible that less than one copy of an entire genome is used, such as in a degraded sample or a haploid sperm cell, as long as sufficient portions of genomic nucleic acid exist at enough SNPs to discriminate between a mixture and a person. This can be as few as a 1,000 SNPs, noting that millions of SNPs are known within the human genome. For example, one can identify an individual using only SNPs on chromosome 1.
[0039] The term "test genetic material sample" denotes the sample whose composition is in question. Typically, one would like to know if a specific individual contributed to the genetic material in the test genetic material sample, and/or if other people or organisms contributed to the genetic material in the test genetic material sample. In some embodiments, the test genetic material sample is the sample that is to be or has been assayed for the presence or absence of various SNPs. In some embodiments, the target nucleic acid is contained within the test genetic material sample.
In some embodiments, the target nucleic acid is not within the test genetic material sample. The "sample SNP signature" is the SNP signature for the test genetic material sample.
[0040] The term "SNP signature" denotes one or more various SNPs and the genotype, alleles, and/or percentage thereof for a collection of SNPs to be assessed. A
"reference signature" denotes the alleles present for the SNPs in the reference (or a population thereof). A "test genetic material sample signature" denotes the alleles present for the SNPs in the test genetic material sample. A "subject's SNP signature,"
"Person of Interest's SNP Signature," or other similar term denotes the alleles present for the SNPs in the subject or Person of Interest. The term SNP signature does not require that the entire SNP signature be used (unless the term "entire" is explicitly used).
Thus, comparing, employing and/or using one SNP signature with or to another SNP
signature can be achieved merely by comparing a subset of the frequencies of the various alleles or by other approaches described herein. In addition, while a SNP signature can denote one or more various SNP alleles and their frequency(ies), it should be understood that a comparison of the SNP signatures encompasses any comparison of one or more SNPs from one source to one or more alleles from a second source, as such, "comparing" a first and a second SNP signature does not actually require comparing the frequency statistics for each SNP allele (unless explicitly stated), but can be achieved by comparing and/or analyzing any data or computation that relates to these frequencies. As such, the comparison can also be achieved by comparing values (including raw data) that are used to derive the noted frequencies. It can also be achieved by comparing values that are subsequently derived from the noted frequencies. One of skill in the art will appreciate how to maintain the appropriate relationships between the various SNP
signatures, based upon the present disclosure.
[0041] While the term "person of interest" is occasionally used herein, one of skill in the art will appreciate that the term is generally interchangeable with the term "subject". Thus, in regard to the present disclosure, a "person of interest"
is not limited to a human being and, unless specified, can be any subject, such as any subject that includes genetic material (human, mammal, bacterial, viral, etc.). The term "Person of Interest"
does denote that the subject is the one whose genetic material is being examined in the test genetic material sample. While this subject can typically be human, for example in many forensics tests, it is not limited to humans, unless explicitly noted.
[0042] The term "reference population" denotes a population of one of more reference subjects. The SNP signature of the reference subjects allows for a comparison between the SNP signature of the person of interest and the SNP signature of the test genetic material. A reference population or SNP signature of a reference population is not required for all embodiments disclosed herein. In some embodiments, the reference population and reference SNP signature will have a similar ancestral make-up as that of the sample SNP signature. The term "similar ancestral make-up" can be defined as a genetic distance between individuals or within a population using a set of SNPs or other genetic variants. Thus it is possible for some SNPs to be reserved for assessing ancestry and some SNPs reserved for assign wither a POI is within a mixture. In some embodiments, the reference population should generally match the mixture at the SNPs being interrogated at the SNPs being investigated..
[0043] A SNP is an inherited substitution of a nucleotide (for example from A
to T, A to G, or G to C) found within more than two individuals. Generally most SNPs exceed a frequency greater than 0.1%, though lower frequency genetic variants are also envisioned. The methods described herein are extendable to other types of genetic variants, including indels, copy number changes, and/or other structural variants.
GENERAL EMBODIMENTS

[0044] Establishment of test-statistic. There are multiple approaches to derive a test-statistic to evaluate a hypotheses that a subject's genetic material is within a mixture, and these are discussed further in herein. In some of the examples below, a frequentist approach is used. In some of the examples below a Bayesian approach, is used. Either can be used depending on the objective of the assay. In some embodiments, other approaches are used without deviating from the present methods.
[0045] An overview of some embodiments of the approach is provided in FIG. 1A. In some embodiments, this method can be summarized as the cumulative sum of allele shifts over all available SNPs, where the shift's sign is defined by whether the individual of interest is closer to a reference sample or closer to the given mixture. One aspect of the invention encompasses genotyping a given SNP of a single person, which addresses the original design of SNP genotyping microarrays. In some embodiments, the invention can be further adapted method to mixtures and pooled data.
[0046] Genotyping microarray technology can assay millions of SNPs.
Genotypes are expected to result from an assay and data is categorical in nature, e.g. AA, AB, BB, or NoCall where A and B symbolically represent the two alleles of a biallelic SNP. However, as evident from copy number, calling algorithm, and pooling-based GWA studies (Pearson et al.; Am J Hum Genet. 2007 Jan;80(1):126-39. Epub 2006 Dec 6.), raw preprocessed data from SNP genotyping arrays is typically in the form of allele intensity measurements that are proportional to the quantity of the "A" and "B" alleles hybridized to a specific probe (or termed features) on a microarray.
Individual probe intensity measurements can be derived from the fluorescence measurement of a single bead (e.g. Illumina), micron-scale square on a flat surface (e.g. Affymetrix) or some combination thereof. On a genotyping array, multiple probes are present per SNP at either a fixed number of copies (Affymetrix) or a variable number of copies (Illumina). For example, recent generation Affymetrix arrays typically have 3 to 4 probes specific for the A allele and B allele respectively, whereas Illumina arrays have a random number of probes averaging approximately 18 probes per allele. With 500,000+ SNPs, there are millions of probes (or features) on a SNP genotyping array. While there are considerably different sample preparation chemistries prior to hybridization between SNP
genotyping platforms, any of these chemistries can be used, as they should not impact various embodiments disclosed herein.
[0047] SNP genotyping algorithms typically begin by transforming normalized data into a ratio or polar coordinates. For simplicity, one can utilize a ratio transformation YZ =Aj(A,+k,B), where A; is the probe intensity of the A allele and B is the probe intensity of the B allele in the jth SNP. Multiple papers have shown that Y
transformation approximates allele frequency, where kk is the SNP specific correction factor accounting for experimental bias and is easily calculated from individual genotyping data. Thus with this transformation, Yi is an estimate of allele frequency (termed pA) of each SNP. Since most individuals contain two copies of autosomal SNPs, values of the A allele frequency (pA) in a single individual may be 0%, 50%, or 100% for the A allele at AA, AB, or BB, respectively. Equivocally Y will be approximately 0, 0.5, or 1, varying from these values due to measurement noise. By example and assuming kj=1, probe intensity measurements of 4=450 and Bj=550 yield Y=0.45 and this SNP
would be called AB. In a sample from a single individual, one would thus expect to see a trimodal distribution for Y across all SNPs since only AA, AB, or BB genotype calls are expected. However, in a mixture of multiple individuals, the assumptions of the genotype-calling algorithm are invalid, since only AA, AB, BB, or NoCall are given regardless of the number of pooled chromosomes.
[0048] However, one of skill in the art, given the present disclosure, will be able to extract information and meaning from the relative probe intensity data and so be able to use that data to, for example, identify if a subject contributed to the mixture. In some embodiments of the method, one compares allele frequency estimates from a mixture (termed M, where M; =Al/(A;+k;B;)) to estimates of the mean allele frequencies of a reference population. As used herein, the allele frequency estimates of the mixture are also encompassed within the term sample SNP signature. In addition, as used herein, the mean allele frequency of the reference population is also encompassed within the term reference SNP signature.
[0049] The selection of the reference population, where required, is discussed in more detail below. In some embodiments, one assumes that the reference population has a similar ancestral make-up as that of the mixture. This can mean having similar population substructure, ethnicity, and/or ancestral components interchangeably, and define similar ancestral components of an individual or mixture as having similar allele frequencies across all (or substantially all) SNPs.
[0050] One can let Yij be the allele frequency estimate for the individual i and SNP j, where Yij E (0, 0.5,1), from a SNP genotyping array. The allele frequency estimate for the individual is also encompassed within the term subject SNP
signature.
[0051] One then compares absolute values of two differences. The first difference I Yij - M J (which can also be characterized as the absolute value of the sample SNP signature subtracted from the subject SNP signature) measures how the allele frequency of the mixture M at SNP j differs from the allele frequency of the individual Yip for SNP j (or, put another way, measures how the sample SNP signature differs from the subject SNP signature). The second difference IYZ; - Pope (which can also be characterized as the absolute value of the reference SNP signature subtracted from the subject SNP signature) measures how the reference population's allele frequency Pop1 differs from the allele frequency of the individual Y,,; for each SNP j (or, put another way, measures how the reference SNP signature differs from the subject SNP
signature). The values for Popp can be determined from an array of equimolar pooled samples or from databases containing genotype data of various populations. Taking the difference between these two differences, one obtains the distance measure used for individual Y, :
D(Y,) = I Y,,, - Popp I - I Y,; - M I (Equation 1).

[0052] As shown in FIG. 1A, under the null hypothesis that the individual is not in the mixture, D(Y,) approaches zero since the mixture and reference population are calculated to have similar allele frequencies due to having similar ancestral components.
Under the alternative hypothesis, D(Y,)>0 since one predicts that the M is shifted away from the reference population by Y,'s contribution to the mixture. In the case of D(Y,))<0, Y, is more ancestrally similar to the reference population than to the mixture, and thus less likely to be in the mixture. Consistent with the explanation of FIG. 1A, D(Y,j) is positive when Yij is closer to M and D(Y,) is negative when Yij is closer to Popp.
By sampling numerous SNPs (e.g., 500K+ SNPs), one would generally expect D(Y,) to follow a normal distribution due to the central limit theorem. In some embodiments, one can take a one-sample t-test for the subject, sampled across all (or at least one or more) SNPs, and thus obtain the test statistic:

T(Y,) = (mean(D(Y,j)) -,uo) / (sd(D(Y,,)/ sqrt(s))) Equation 2 In equation (2) assume fib is the mean of D(Yk) over individuals Yk not in the mixture, sd(D(YZ)) is the standard deviation of D(Y,) for all SNPs j and individual Y,, and sqrt(s) is the square root of the number of SNPs. In some embodiments, one can set go at zero since a random individual Yk should be equally distant from the mixture and the mixture's reference population and so T(Y;) = mean(D(Y1j)) / (sd(D(Y,j)/sgrt(s)). Under the null hypothesis T(Y;) is zero and under the alternative hypothesis T(Y) > 0. In order to account for subtle differences in ancestry between the individual, mixture, and reference populations one can normalize allele frequency estimates to a reference population. If such a large number of SNPs are used that the distribution no longer follows a traditional normal distribution because of correlations between markers, one can also use individuals known not to be within the mixtures to sample distributions in the case that SNPs within linkage disequilibrium are used. In this case, additional methods can also be used to correct and learn the distribution of the test-statistic, such as from the HapMap, and appropriately estimate p-values.
[0053] While the above discussion provides an analysis for how data can be compared and analyzed by a frequentist approach, one of skill in the art, given the present disclosure, will appreciate that other approaches are useful as well. For example, as discussed below, a Bayesian approach can be used in some embodiments.
[0054] As discussed above and shown below, high-throughput SNP
genotyping microarrays have the ability to accurately and robustly resolve whether an individual trace contributions are in a complex genetic material mixture. The following section establishes a probabilistic model and uses Bayesian inference to accurately compare two models: the model where the individual is assumed to be in the mixture and the model where the individual is assumed not to be in the mixture. Using a training dataset one is able to use the raw data for each probe on a microarray instead of using genotypes from a genotyping calling algorithm or other such data transformation.
Through a posterior odds ratio comparing the two models, one is able to assess the likelihood of the individual being in the mixture using observations on a genomic scale.
With the Bayesian method, one provides further options for using SNPs in identifying individual trace contributors within a test genetic material sample.
[0055] As noted above, one challenge in the field of forensics is to identify an individual is present in a highly complex mixture of genomic DNA. As noted herein, this same challenge is present in a variety of other techniques as well, and thus addressing this forensics issue has immediate applications in many other fields. Many methods currently exist that can examine mixtures with a small number of individuals and mixtures composed of thousands of individuals (see, e.g., T. Egeland, I. Dalen, and P.F. Mostad.
Estimating the number of contributors to a DNA profile. Int. J. Legal Med., 117:271 {275, Oct 2003; Y.Q. Hu and W.K. Fung. Interpreting DNA mixtures with the presence of relatives. Int. J. Legal Med., 117:39-45, Feb 2003; and D.J. Balding.
Likelihood-based inference for genetic correlation coefficients. Theor Popul Biol, 63:221-230, May 2003).
These methods include using short tandem repeats (STR) used to generate DNA
profiles, including STRs on the Y chromosome specifically used to identify the male components of the mixture. (see, e.g., T.M. Clayton, J.P. Whitaker, R. Sparkes, and P.
Gill. Analysis and interpretation of mixed forensic stains using DNA STR profiling. Forensic Sci. Int., 91:55-70, Jan 1998; R.G. Cowell, S.L. Lauritzen, and J. Mortera.
Identification and separation of DNA mixtures using peak area information. Forensic Sci. Int., 166:28-34, Feb 2007; M. Bill, P. Gill, J. Curran, T. Clayton, R. Pinchin, M. Healy, and J. Buckleton.
PENDULUM{a guideline-based approach to the interpretation of STR mixtures.
Forensic Sci. Int., 148:181-189, Mar 2005; M.A. Jobling and P. Gill. Encoded evidence:
DNA in forensic analysis. Nat. Rev. Genet., 5:739-751, Oct 2004. Methods using Mitochondrial DNA (mtDNA)) are useful when analyzing severely degraded DNA and can be used jointly with STRs. Goodwin, A. Linacre, and P. Vanezis. The use of mitochondrial DNA
and short tandem repeat typing in the identification of air crash victims.
Electrophoresis, 20:1707-1711, Jun 1999). A number of methods have also investigated using a very small number of SNPs with mtDNA to mitigate specific problems with mtDNA (M.D.
Coble, R.S. Just, J.E. O'Callaghan, I.H. Letmanyi, C.T. Peterson, J.A. Irwin, and T.J. Parsons.
Single nucleotide polymorphisms over the entire mtDNA genome that increase the power of forensic testing in Caucasians. Int. J. Legal Med., 118:137-146, Jun 2004;
T.J. Parsons and M.D. Coble. Increasing the forensic discrimination of mitochondria) DNA
testing through analysis of the entire mitochondrial DNA genome. Croat. Med. J., 42:304-309, Jun 2001; R.S. Just, J.A. Irwin, J.E. O'Callaghan, J.L. Saunier, M.D. Coble, P.M. Vallone, J.M. Butler, S.M. Barritt, and T.J. Parsons. Toward increased utility of mtDNA
in forensic identifications. Forensic Sci. Int., 146 Suppl:S147-149, Dec 2004;
and P.M.
Vallone, R.S. Just, M.D. Coble, J.M. Butler, and T.J. Parsons. A multiplex allele specific primer extension assay for forensically informative SNPs distributed throughout the mitochondrial genome. Int. J. Legal Med., 118:147-157, Jun 2004) but have not investigated SNPs exclusively on the genomic scale as the determining factor for inclusion in a complex mixture. Recently, Homer et al. (Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density snip genotyping microarrays, the entirety of which is hereby incorporated by reference) and the present disclosure presented herein have demonstrated that high-throughput SNP genotyping microarrays have the ability to accurately and robustly resolve whether an individual trace contributions are in a complex genomic DNA
mixture. This genomic approach does not target specific sequences, regions or small number of polymorphisms, but instead can employ multiplex experiments performed on SNP microarrays to resolve whether an individual is present in a complex mixture. In some embodiments, this method also does not rely on knowing the number of individuals in the mixture. SNP microarrays have been widely used in Genome-wide Association studies, and when applied to Forensics SNP microarrays over a level of multiplexing not previously found in other methods. Nevertheless, Homer et al. (and the results discussed above and in Example 1) provide a frequentist approach based on cumulative shifts of relative allele signals across all SNPs to provide a significance value for the null hypothesis, where the individual is assumed not to be in the mixture. In some embodiments, two microarrays can be run, one using DNA from the individual of interest and one using the pool of DNA from the mixture. This allows one to use a reference population for comparison, allowing one to accurately identify if an individual is present in the mixture. Additionally, this can be achieved even if a relative's DNA
was used as a proxy for the individual of interest. Although such an embodiment performs well for many complex mixtures, other approaches can be used and as such, a probabilistic model is presented in the following section.

Bayesian [0056] The following section discloses a probabilistic model based on the total observations at the raw intensity level for SNP microarrays to accurately assess the likelihood that the individual of interest (e.g., subject) is or is not in the complex mixture (e.g., test genetic material sample). Additionally, a training dataset was used to estimate the probability distribution of the raw intensity level observations. Two models were compared, one where the individual of interest is assumed to be in the mixture, and another where the individual of interest is assumed not to be in the mixture, in the form of a posterior odds ratio. The likelihood of each of the two models was derived using Bayesian inference to accurately assess the probability of the observations.
With this embodiment, a more robust and accurate model of the observations was created, giving a better statistical measure of evidence. As the number of SNPs available on current microarray technologies continues to increase, so will the accuracy of various embodiments of the method to identify the contribution of an individual to a highly complex mixture.
Models Two Competing Models [00571 The modeling is performed to identify whether or not an individual is present within a given complex mixture. Therefore one can examine the odds ratio between two competing models, one where the individual is assumed to be in the mixture (denoted OA) and one where the individual is assumed not to be in the mixture (denoted 00). There are two distinct observations, one set of observations from the individual of interest and one set of observations from the complex mixture. The observations for the individual of interest are denoted as z and the observations for the complex mixture were denoted as y for all s SNPs. For SNP i the observation x, for the individual of interest (e.g., subject) is a raw intensity value, and the observation yi for the complex mixture is similarly defined.
[00581 On a given microarray there are typically multiple probes per SNP as well as pairs of intensity values per probe. One can choose to treat probe value (a pair of intensity values) separately or combine the probes into a single measure. For this analysis, the probe values can be combined by taking the mean probe value over all probes, and combing the pair of intensity values into a simple ratio of the two values. For example if one had the intensity pair X and Y one can use the ratio XX Y or for a more elegant ratio of arctan ( ). Nevertheless, combing the intensity values in this manner has been used in previous studies using complex mixtures of DNA, namely pooling-based Genome-wide Association studies (J.V. Pearson, M.J. Huentelman, R.F. Halperin, W.D.
Tembe, S. Melquist, N. Homer, M. Brun, S. Szelinger, K.D. Coon, V.L. Zismann, J.A.
Webster, T. Beach, S.B. Sando, J.O. Aasly, R. Heun, F. Jessen, H. Kolsch, M.
Tsolaki, M. Daniilidou, E.M. Reiman, A. Papassotiropoulos, M.L. Hutton, D.A. Stephan, and D.W. Craig. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am. J.
Hum.
Genet., 80:126-139, Jan 2007) and this method was adopted.
[0059] To compare the two models the posterior odds ratio Pr(Y x,ea) Pr(Y x, Bo ) was examined. If the odds ratio is large, then this gives evidence that the individual of interest is in the mixture. If the odds ratio is small, then this gives evidence that the individual of interest is not in the mixture. In this manner one is able to resolve whether the individual is present within the complex mixture.

Likelihoods [0060] Suppose one had s SNPs, one denotes the observations as (y,, ..., yr) and z = (xi ... x5). Nevertheless, to formulate a likelihood correctly a number of hidden variables should be known. Let ri + 2 be the number of chromosomes in the mixture. Since each individual in the mixture contributes two chromosomes, r) is a multiple of two. For each SNP i, suppose one has the two alleles A and B. One should then know the number of A alleles in the mixture xi and the number of A
alleles in the person of interest (3i. Since by definition r1, xi, and (3i are hidden, to compute the likelihood of either model one should sum over all possible values for these three hidden variables. For consistency Greek letters for hidden variables and alphabet letters for observed variables were used.

Training Dataset [0061] Given the observed and hidden variables more information is useful to accurately compute the likelihoods. Since one has raw intensity values instead of genotypes for both the mixture and the person of interest, one should know the conditional probability Pr(R, = r, I F, = y;) for yi C {O, 1, 2}. This is the conditional probability that for SNP i the relative intensity value is r, given the hidden unordered genotype is i where denote the unordered genotype A/A to be 0, A/B to be 1, and B/B to be 2. Again one does not know yi for each SNP i and each individual in the mixture or for the individual of interest but in this case one can estimate the distribution of these probabilities by using a training dataset, from the HapMap Project (The International HapMap Project. Nature, 426:789-796, Dec 2003). From the HapMap Project one is able to obtain for a given individual both the consensus genotype calls and raw intensity values for each SNP on the Affymetrix 5.0 platform. The HapMap project has this information for 270 individuals from four distinct populations. Additionally, the genotypes for each SNP were not only derived from the corresponding raw intensity values but also from other microarray platforms and replicate experiments resulting in a consensus genotype call for each SNP. This gives one further assurance that the genotype call is correct.
[0062] Therefore for each SNP i one can plot three distributions for r, given each of the possible unordered genotype yi. To simplify, one assumes that each of the three distributions Pr(R; = r1 I F1= 0), Pr(R; = r1 I F1 = 1), and Pr(R, = r1 I F, = 2) follow normal distributions N( o, 6o), N( 1, 61), and N( 2, 62) respectively. One can estimate o, l, 2, 6o, 61, 62 easily from the training data set and use these parameters in the calculation of the likelihoods.
[0063] Finally, this training data set gives, for each SNP i, the population allele frequency of A denoted p1. It is useful when selecting the training dataset population to consider the ancestry of the population since allele frequencies can vary over population, and therefore introduce systematic biases in the model.
Nevertheless, if SNPs used in the likelihood calculations are chosen to be ancestrally unbiased and unlinked, one avoids an admixture problem and can treat each SNP independently.

Computing the likelihood of 0 0;

[0064] First, the model 0 0 with the assumption that the person of interest is not in the mixture, is examined. Therefore the likelihood of 00; is just Pr(y 1.z, 00). Since one does not observe the number of chromosomes in the mixture Tl one can sum over all possible values of fl.

Pr(i I > 00) I{11%2-0}Pr (J I ! } O0)Pr( } O ) q=0 where I{t1%2=o} is one if Tl is a multiple of two, zero otherwise. One can assume an uniformative (uniform) prior for rl as well as setting a limit on the maximum value for rl given the specific scenario. Therefore one lets Pr(i I z 0c) be uniform over all values of 71.
[0065] Since each SNP was defined to be independent one can simply examine each SNP i independently and take the product over the probabilities for each SNP so that Pr( ( ?1t Xa 0 flPr(y I m 0 ) Xi i=O

To calculate Pr(y; 117, xi, 90) one should know the number of A alleles in the mixture, denoted x;. Since x; is hidden one can simply sum over all possible values of xl. In the 00 model, the individual of interest is not in the mixture so xT can range from 0 to rj + 2 giving ry 11+2 Pr (yi 1 77, xi, O0) - E PT (#i I iT 37, Xi! 00)Pr(tZ I ?} Xi, 00) Xi=0 One assumes that P?-(ii 11,xi,00) follows a binomial distribution B(ij + 2; p;) where p; is the allele frequency of allele A
obtained from the training dataset. Therefore one has + 2) 77 Ki Pr ri 77rXi, 00) _ ( [0066] Additionally, one does not directly observe the number of A alleles for the individual of interest Pi so one simply sums over all possible values of (3i giving r 2 r /~ } 1 Pr(y I %i 171, ni3 j 1: Pr(y 1 A , ij, , O0)Pr(gi I x ,'q, Ki, OO) i=O
j=~

To calculate the final two probabilities and Pr(i3 I xi,,jq, i7 Oo) one uses the three probability distributions estimated from the training dataset: Pr(R; = r; I
Fi = 0), Pr(RZ = r; I F; = 1), and Pr(R; = r; I F, = 2). Since it was assumed that these three distributions were normally distributed one has that Pr(Y I i) 77) ni) 0) _ Pr(i I '] , i, 00) = (AA ir + i) Here one has that A, = K' . To smoothly interpolate between the three different (r~+2) distributions, if ki > 0.5 then xt = 2(2a;-1) + l(2-2X), and if ki < 0.5 then p.,\t = prl ( i) + /2 (1 _" 2 1i) For the second probability one similarly has Pr 3 I i},1, ni,-00) =-~ Pr(3i I Xi --- Glfli) 6 i) Since (3; is zero, one, or two one knows which distribution to use because one can infer the unordered genotype from (3i. If (3; = 0 then p; = o and 6p; = a o, if (3; = 1 then p; = l and rp1=o ,andif(3,=2then pl=1t2andap,=62.

Computing the likelihood of OA

[00671 Next one examines the model OA with the assumption that the person of interest is in the mixture. Therefore the likelihood of OA is just Pr(y I z , OA ). Since one does not observe the number of chromosomes in the mixture r) one should sum over all possible values of r).

x, &4) - E I(11%2=0} '?'(y ( `f7.~, OA)P?'(fi? I ~ 6A) 71=0 where I{i%2=o} is one if q is a multiple of two, zero otherwise. Similar to the Oo model one can assume an uniformative (uniform) prior for r1 as well as setting a limit on the maximum value for rl given the specific scenario. Therefore one lets Pr(ri I
z, BA) be uniform over all values of il.
[00681 Since each SNP was defined to be independent one can simply examine each SNP i independently and take the product over the probabilities for each SNP so that Pr(j 17TH, BA) = Pr(?li I 77t xi OA) i=O
Under the 8A model one assumes that the individual of interest is in the mixture.
Therefore unlike the 8 0 model one has that the number of A alleles in the mixture is partly dependent on (3;. Therefore one first sums over all possible values for R;:

Pr( i 110, Xi, OA) _ 1: Pr(yi I i,17, OA)PT(3i 17], Xi, OA) One assumes that the individual of interest (e.g., subject) contributes two chromosomes to the mixture. Thus when one sums over all possible values of x;, one allows is to range from 0 to rl, excluding two the two chromosomes determined by (3;. Therefore one has that Pr(Yi I & m OA) = Pr(yi I Ki, )3i M, 0A)P? (K-i I ,,3i, 0A) One assumes that Pr(ti I 77,/3i, OA) follows a binomial distribution Is (71,Pi) where pi is the allele frequency of allele A obtained from the training dataset. Therefore one has px )t i si p,) (17- Ki) PT (Ki If.
Finally, similar to the 0 0 model find the probabilities Pry I Ki} $? 1i OA

and Pr(,B; 1)7, xi, OA) be using the three probability distributions obtained from the training dataset:
Pr(.F = ri I ri = 0), Pr(Rj = ra I i = 1), and Pr(Ri = ri I ri = 2) Therefore one has that Pr (i I i } , 77, OA) = Pr=(yz 171, i i OA) = (. ; } cr,t ) Here one has that /,, = K, + R` . This definition of 2T differs from the one under the 0 c ;
(n+2) model since one now has conditioned on the individual of interest contributing Pi A
alleles. Similar to 0 o, one smoothly interpolates between the three different distributions, if XT > 0.5 then ,u,, = ,u2 (22, -1) + ,u, (2 - 2A) and if a,T < 0.5 then ,u,; = p,(2A) +fuo(1-22).

[0069] For the second probability one similarly has Pr(/3 I ?], i, OA) = P7- ( I Xi) = (1131 t aj3x Since Pi is zero, one or two one knows which distribution to use because one can infer the unordered genotype from P; . If Pi = 0 then pj= o and 6pj = ao, if (3;=1 then 13T= i and 6pT = 6i, and if (3T=2 then p,= 2 and apT = 62=

COMPUTATIONAL COMPLEXITY

[0070] One first observes that computing the probability mass function of the binomial distribution is not a constant operation and depends both on 11 and KT in the specific application. Naively this is dominated by rl multiplications (of pi and (1 - p,) combined) and the term (11+ 2) , which in the worst case requires O(TI) operations. One also can compute the probability mass function of the normal distribution. Let the time to compute this be t [0071] Let 111 be the maximum value for ill then it is then easy to see that the time to compute 0 0 or eA is simply M S q+2 NO( ._ 4 ~11=0 Ei=0 E=0 t i!
E 771 s =O(12 $'(7]') The space complexity for this algorithm is 0(l) since one can examine each SNP
independently.

EXTENSIONS
[0072] A factor of the above model is the practical implementation. When computing these probabilities it is clear that some of probabilities calculated above may approach zero and therefore be -oo when calculated in log space. It is useful that when computing these probabilities that care is taken to perform the computations in log space without introducing errors.
[0073] There are a number of extensions to this method that can improve the model. Firstly, one can make sure to select a set of SNPs that are independent since one treats each SNP independently in the calculation. For example, on the Affymetrix 5.0 SNP microarray platform there are approximately 500,000 SNPs. To ensure that SNPs are not correlated, the resulting set of SNPs is approximately one-tenth the size of the original set. To be sure, one is throwing out a lot of redundant and useful information. An extension of the method is not to assume independence between SNPs and instead adjust for the correlation between SNPs, thus utilizing the full set of SNPs present on current microarray platforms.
[0074] One also implicitly assumes that the mixture and individual of interest have the same ancestral make-up as the training dataset. For example, if the individual of interest and mixture are ancestrally native American, one may lose power if one uses a Caucasian or Asian training dataset. To correct for this problem, one can choose training datasets that rejects the ancestry of the mixture and individual of interest.
Additionally, one can also choose SNPs whose allele frequency does not vary across populations.

[0075] Since one assumes that the probability of xi is binomially distributed, one implicitly assumes Hardy-Weinberg Equilibrium (HWE). This is not true for many SNPs and one can take care when calculating the allele frequency pi from the training set.
One could instead test for HWE for each SNP by using a training dataset and exclude a certain percentage of SNPs from further analysis.
[0076] In the analysis for each SNP, multiple probes were combined and for each probe the relative intensity values were combined. To extend the method and to completely use the raw data values, one can treat the probes as multiple identically distributed observations for the given snip, and treat each intensity value for the probe separately. Therefore when one computes Pr(R; = r, I F, = )/,) one would have six distributions instead of three rejecting the fact that the intensity values for each allele were treated separately.
[0077] In the above section, a probabilistic model was established for identifying trace contributions of an individual within a complex DNA mixture.
Previous methods relied on sequencing or probing small portions of DNA or mtDNA (T.
Egeland, 1. Dalen, and P.F. Mostad. Estimating the number of contributors to a DNA
profile. Int. J.
Legal Med., 117:271 {275, Oct 2003; Y.Q. Hu and W.K. Fung. Interpreting DNA
mixtures with the presence of relatives. Int. J. Legal Med., 117:39-45, Feb 2003; D.J.
Balding. Likelihood-based inference for genetic correlation coefficients.
Theor Popul Biol, 63:221-230, May 2003; T.M. Clayton, J.P. Whitaker, R. Sparkes, and P.
Gill.
Analysis and interpretation of mixed forensic stains using DNA STR profiling.
Forensic Sci. Int., 91:55-70, Jan 1998; R.G. Cowell, S.L. Lauritzen, and J. Mortera.
Identification and separation of DNA mixtures using peak area information. Forensic Sci.
Int., 166:28-34, Feb 2007; M. Bill, P. Gill, J. Curran, T. Clayton, R. Pinchin, M. Healy, and J.
Buckleton. PENDULUM{a guideline-based approach to the interpretation of STR
mixtures. Forensic Sci. Int., 148:181-189, Mar 2005; M.A. Jobling and P. Gill.
Encoded evidence: DNA in forensic analysis. Nat. Rev. Genet., 5:739-751, Oct 2004; W.
Goodwin, A. Linacre, and P. Vanezis. The use of mitochondrial DNA and short tandem repeat typing in the identification of air crash victims. Electrophoresis, 20:1707-1711, Jun 1999; M.D. Coble, R.S. Just, J.E. O'Callaghan, I.H. Letmanyi, C.T. Peterson, J.A. Irwin, and T.J. Parsons. Single nucleotide polymorphisms over the entire mtDNA genome that increase the power of forensic testing in Caucasians. Int. J. Legal Med., 118:137-146, Jun 2004; T.J. Parsons and M.D. Coble. Increasing the forensic discrimination of mitochondrial DNA testing through analysis of the entire mitochondrial DNA
genome.
Croat. Med. J., 42:304-309, Jun 2001; R.S. Just, J.A. Irwin, J.E. O'Callaghan, J.L.
Saunier, M.D. Coble, P.M. Vallone, J.M. Butler, S.M. Barritt, and T.J.
Parsons. Toward increased utility of mtDNA in forensic identifications. Forensic Sci. Int., 146 Suppl: S 147-149, Dec 2004; and P.M. Vallone, R.S. Just, M.D. Coble, J.M. Butler, and T.J.
Parsons. A
multiplex allele specific primer extension assay for forensically informative SNPs distributed throughout the mitochondrial genome. Int. J. Legal Med., 118:147-157, Jun 2004) and did not use the whole genome (or genome wide analysis) to answer this. With the increasing density and decreasing price of current SNP microarray technologies, it is feasible to probe over a million SNPs for under one-thousand dollars and thus giving a genomic perspective on this problem.
[00781 The above analysis leverages the number of SNPs on the microarrays to accurately assess the probability that an individual of interest (e.g., subject) is present within a highly complex mixture. Since the number of SNPs on microarrays is now over one-million, one is able to obtain a sufficient number of observations to determine inclusion when compared to previous methods. This embodiment of the method specifically computes the posterior odds ratio between two models. The first model assumes the individual of interest is not present in the mixture and the second model assumes the individual of interest is present in the mixture. One then derives a likelihood function for both models given the observations of the mixture and individual of interest.
A training dataset is used to provide for each SNP probability distributions for the observed probe intensity values given the unordered genotypes. While the above Bayesian approach demonstrates some embodiments for performing the comparison or methods described herein, these processes or steps are not required for all of the embodiments described herein. While the above description (and below demonstration of the above described process) establishes the proof of concept and functionality of various embodiments of the invention, one of skill in the art will appreciate that there are a wide variety of techniques or operations by which the general method can be performed and how it can be put to practical use. While only a summary of some of the possible embodiments, FIG. 1B depicts a more schematic representation of how the genetic material matching techniques described herein can be employed.
[0079] As shown in FIG. 1B, in some embodiments, one can initially start some of the embodiments described herein by optionally obtaining a sample that can (but need not) include genetic material (e.g., a test genetic material sample) as shown in process 10. One can then, optionally, purify and/or amplify at least some of any genetic material within the sample as shown in process 20. One can then, optionally, prepare the sample to be run on a SNP array as shown in process 30. One can then, optionally, determine one or more SNPs in the sample to obtain a sample SNP signature as shown in process 40. One can then, optionally, obtain a SNP signature of a reference population as shown in process 50. This SNP signature can be, for example, created by a SNP
analysis of a reference population, or obtainable in data form. One can then, optionally, obtain a SNP signature of a subject, as shown in process 60. One can then determine if there is a direction or bias of an allele count and/or frequency within the sample relative to the reference and/or the subject's signature as shown in process 70. One can then, optionally, analyze the direction or bias to determine a likelihood that the subject's genetic material is in the sample as shown in process 80. One can, optionally, have any of the results from the above processes output to an end user or memory 90. In some embodiments, one can, optionally, output any correlation (or lack thereof) between the subject SNP
signature and the sample SNP signature and/or the reference SNP signature to an end user, display, memory, and/or computer readable storage. In some embodiments, this information is output or provided to the subject.
[0080] In some embodiments, any one of more of the processes in FIG. 1B
are performed by a module configured to perform the process, which, optionally, can be part of a system. Thus, in some embodiments, FIG. 1B also represents modules that are capable of performing the steps for optionally obtaining a sample that can (but need not) include genetic material (e.g., a test genetic material sample) as in 10; a module to optionally purify and/or amplify at least some of any genetic material within the sample as shown in 20; a module to optionally prepare the sample to be run on a SNP
array as shown in 30; a module to optionally determine one or more SNPs in the sample to obtain a sample SNP signature as shown in 40; a module to obtain a SNP signature of a reference population as shown in 50; a module to optionally obtain a SNP
signature of a subject, as shown in 60; a module to determine if there is a direction or bias of an allele count and/or frequency within the sample relative to the reference and/or the subject's signature as shown in 70; a module to optionally analyze the direction or bias to determine a likelihood that the subject's genetic material is in the sample as shown in 80;
a module to optionally have any of the results from the above output to an end user or memory 90. It will be understood, however, that this illustration is merely exemplary and that such modules or components can be executed on a plurality of computing devices, on one or more virtual machines, as stand-alone components, or the like.
[0081] In some embodiments, one also has a module to output any correlation (or lack thereof) between the subject SNP signature and the sample SNP
signature and/or the reference SNP signature to an end user, display, memory, and/or computer readable storage. In some embodiments, this information is output or provided to the subject. In some embodiments, the system comprises an input module, to input one or more SNP
signatures; a processing module, to compare the two or more SNP signatures;
and an output module, to output the comparison. In some embodiments, any one or more of the above modules are executed on one or more computing devices. In addition, methods and functions described herein are not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state.
[0082] While a likelihood determination is one useful way of displaying any present correlation between the genetic material in the test genetic material sample and the subject's genetic material, any other way of displaying the correlation between the subject's genetic material and the test genetic material sample and/or the reference population's genetic material can also be used and output to an end user or memory.
[0083] Appendix A is the computer programming listing appendix referred to above, which is part of this specification. It provides some embodiments of code files usable for executing some embodiments of the processes and/or modules provided herein.
The first code in Appendix A and any other code in Appendix A are nonlimiting examples of the code that can be employed for some of the present embodiments.
The code used in connection with the present invention need not include any or all of the code listed in Appendix A at the end of the specification. Nevertheless, in some embodiments, the computer programming comprises, consists, or consists essentially of the code listed on the first 84 pages of Appendix A.
VARIATIONS ON EMBODIMENTS

[0084] In some embodiments, a method for determining likelihood that a subject contributed genetic material to a test genetic material sample is provided. In some embodiments, one tests whether a POI is in the mixture by assessing the probability that the allele frequency of the mixture is biased towards the POI, as compared to one or more reference populations.
[0085] Methods and functions described herein are not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state.
COMPLEX MIXTURES
[0086] In some embodiments, a complex genetic material mixture (or test genetic material sample) is one that includes genetic material (such as DNA) derived from more than one source. A complex mixture can also contain compounds, the presence of which causes experimental noise that could mask identification in some techniques, such as STR analysis.
[0087] In some embodiments, the invention involves a method of rapidly and sensitively determining whether a trace amount (<l%) of genomic DNA from an individual source is present within a complex DNA mixture.
[0088] In some embodiments, the test genetic material sample includes a compound that would prevent or complicate STR analysis. In some embodiment, test genetic material sample includes a molecule that degrades nucleic acids. In some embodiments, the test genetic material sample includes proteins and/or enzymes. In some embodiments, the test genetic material sample includes mRNA, RNA, siRNA, and/or DNA.
[0089] In some embodiments, the mixture includes, or is suspected of including genetic material/nucleic acids from more than one human, for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 80, 100, 150, 200, 300, 500, 1000, 10,000 humans or more, including any amount defined between any two of the preceding values or any amount greater than any one of the preceding values.
[0090] In some embodiments, the subject's genetic material in the test genetic material sample is, or is suspected of being the source of less than 100% of the genetic material, for example, less than 100%, 99, 98, 95, 90, 80, 70, 60, 50, 40, 30, 20, 10, 5, 1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001 percent or less of the sample's genetic material is from the subject, including any amount defined between any two of the preceding values or any amount greater than any one of the preceding values.

SAMPLE PREPARATION

[0091] In some embodiments, while STR analysis might otherwise require additional manipulation of a target for analysis of the sample, a test genetic material sample need only be manipulated enough to allow for the application of the sample onto a SNP array In some embodiments, one could expect that it would be acceptable to have SNP drop-out due to the large number of SNPs available for testing. That is if only 10%
of 500,000 SNPs are able to give reliable calls, the 50,000 SNPs are more than sufficient to reliably evaluate a mixture. By comparison, if only 2 of 13 STRs are available there is generally little ability to resolve the mixture.
[0092] In some embodiments, a PCR reaction is performed on the genetic material (reference, subject, and/or test genetic material sample). In some embodiments, this can be a simple PCR reaction, although any method that amplifies the desired genetic material can be used. In some embodiments, primers for the amplification reaction are included in or as part of a kit for the present method. The primers can be selected so as to amplify desired sections of the genetic material to selectively amplify the SNPs to be examined. In some embodiments, the same primers can be used on one or more of the samples from the reference, subject, and test genetic material sample to increase the likelihood that the same SNPs are being reviewed.
[0093] In some embodiments, the use of one or more the methods described herein allows one to reduce the manipulation of the sample (reference, subject, and/or test genetic material sample) prior to examining it to prepare a SNP signature. In some embodiments, impurities that would otherwise complicate a STR analysis are not removed for the SNP analysis.

SOURCES OF GENETIC MATERIAL

[0094] Sources can include human beings, pets, mammals, birds, reptiles, amphibians, other animals, various cell types, algae, slime mold, mollusks, plants, bacteria, viruses, and any other organism that contains genetic material, such as DNA, whether terrestrial or extraterrestrial.

PROBES
[0095] In some embodiments, the SNP probes are selected so as to reduce any undesirable cross-hybridization. In some embodiments, cross-hybridization is addressed by normalizing markers using a quantile normalization approach, and/or by direct measurement of an individual who is homozygote for a given allele. In some embodiments, the probes are random probes. In some embodiments, the probes are those that will hybridize to genetic material that is linked to or similar to standard STR
forensics markers. In some embodiments, the probes allow for examination of genetic material that would be examined via restriction fragment length polymorphism, PCR
analysis, STR analysis, mitochondrial DNA analysis and/or Y-chromosome analysis. In some embodiments, the probes probe genetic material related, the same as, or linked to the 13 specific STR regions for CODIS. In some embodiments, the probes reveal information regarding one or more of the following STR locus: D3S1358, vWA, FGA, D8S1179, D21S11, D18S51, D5S818, D13S317, D7S820, CSFIPO, TPOX, THO1, and/or D16S539. In some embodiments, SNPs that are near the above and/or other known STRs are employed. In some embodiments, SNPs that track the above or other known STRs are employed.
[0096] In some embodiments, the number and variance of the probes is selected based upon the results presented in Example 1, outlining probe variance, probe number, and the number of people in the mixture.

KITS
[0097] In some embodiments, the devices, parts, subparts, or methods described herein can be combined into a kit for practicing any of the disclosed techniques.
In some embodiments, any of the methods can be provide in written format (such as in a set of instructions), or on a computer readable media. In some embodiments, any of the steps or processes described herein that are capable of being executed by a machine can be provided on a computer readable media. In some embodiments, programming that obtains the various SNP signatures can be provided. In some embodiments, programming that compares the various SNP signatures can be provided (such as executing any of the equations provided herein). In some embodiments, programming that outputs a likelihood that a subject contributed to a test genetic material sample is provided. Any such programming can be on computer readable media and/or downloadable from an online source.
[0098] In some embodiments, the kit includes one or more primers for SNP
amplification. In some embodiments, the SNPs, and thus the primers, are specific for regions useful in forensics. In some embodiments, a large number of SNP
primers are used, for example, more than 100, such as 101, 200, 500, 1000, 2000, 5000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, or more SNPs, including any amount defined between any two of the preceding values and any range greater than any one of the preceding values.
[0099] In some embodiments, the kits include one or more reference SNP
signatures. Such SNP signatures can be stored on computer readable media or downloadable from a website. In some embodiments, the reference populations are identified by groups such that the appropriate reference population can be matched with the subject and/or test genetic material sample. In some embodiments, the kit includes one or more subject SNP signatures. Such SNP signatures can include, for example, the SNP signatures of a selection of convicted felons. In some embodiments, reference SNP
signatures can include general selections from the population. In some embodiments, reference SNP signatures are configured for cell selection, biopsies, or any of the other uses provided herein.
[0100] In some embodiments, the kit includes programming and/or software for executing any one or more of steps 10, 20, 30, 40, 50, 60, 70, 80, and/or 90 in FIG.
113. In some embodiments, the programming and/or software is in a memory or on a computer readable memory. In some embodiments, the programming and/or software outputs the results of any of the processes in FIG. 1B. This can include outputting any correlation (or lack thereof) between the subject SNP signature and the sample SNP
signature and/or the reference SNP signature to an end user, display, memory, and/or computer readable storage [0101] In some embodiments, the kit includes a SNP array and ingredients for running a SNP array. In some embodiments the kit includes tools for collecting a forensics sample. In some embodiments, the kits include PCR amplification ingredients.
In some embodiments, the kit includes phi-29 and/or a similar polymerase. In some embodiments, the kits do not include all or any STR analysis ingredients.
VARIOUS APPLICATIONS

[0102] In some embodiments, any of the methods described herein can be applied to determine if a subject's genetic material, such as DNA, matches, is consistent with, or is in a test genetic material sample. In some embodiments, one provides a likelihood that the subject's genetic material is within or the source of the genetic material in the test genetic material sample.
[0103] In some embodiments, any of the methods described herein can be applied to determine whether or not a subject is pregnant. In some embodiments, any of the methods described herein can be applied to determine if a male is the father of an unborn child. In some embodiments, the methods described herein can be applied to determine (including simply determining if the child's genetic material is consistent with) paternity or maternity of a child in comparison to one or more candidate parents. In some embodiments, any of the methods described herein can be applied to determine if there is an unknown person present in the test genetic material sample (in other words, if someone other than or in addition to the subject contributed to the test genetic material sample). In some embodiments, any of the methods described herein can be applied to determine if someone contributed to the test genetic material sample without having to assume or factor in the number of people that may have contributed to the test genetic material sample. In some embodiments, one performs the analysis of the test genetic material sample ignoring and/or without the knowledge and/or without estimating the number of individuals that contributed to a test genetic material sample. In some embodiments, any of the methods described herein can be applied to forensics.
In some embodiments, any of the methods described herein can be applied to determine a percentage or a likelihood that the subject contributed genetic material (or the subject's genetic material is a match) to the test genetic material sample. In some embodiments, any of the methods described herein can be applied to determine or characterize the nature of various cells in a population of cells. This can be useful for sorting or selecting some cells over other cells, or determining the purity of a sample that comprises cells. In some embodiments, any of the methods described herein can be applied on various cells or tissue from a subject. For example, in some embodiments, one can use the methods on a sample from a biopsy and determine if there are malignant vs. benign cells, and/or healthy cells vs. cancerous cells, and/or the type of cancer present in the cells. In embodiments involving numerous cells types, in some embodiments, all or part of the cells can be examined together, instead of having to separate out individual cells. In some embodiments, any of the methods described herein can be applied to determine whether a test genetic material is from a human (and/or which human) in comparison to other nonhuman organisms.
[0104] In some embodiments, the subject SNP signature includes genetic material from (or data representing) multiple individuals. In some embodiments, this can allow for the comparison or screening of multiple individuals against a test genetic material. Thus in some embodiments, the subject SNP is actually one or more subjects to allow for screening one or more subjects against the test genetic material sample.
[0105] In some embodiments, the invention involves a method of identifying trace amounts of an individual's DNA within highly complex mixtures in forensic applications. Such applications include, for example, a situation in which the presence of DNA from numerous other individuals hampers the ability to identify the presence of any single individual. In some embodiments, any of the methods provided herein can be used to analyze genetic material that is degraded or from the mitochondria. The large number of assayed SNPs can allow the partitioning of sets of SNPs for different analyses, such that a small subset of SNPs becomes reserved for detecting these and other artifacts. In some embodiments, the test genetic material sample includes, or is assumed or believed to include genetic material from at least 2 subjects, for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, 500, 1000, or more subjects, including any range defined between any two of the preceding values and any range above any one of the preceding values [0106] In some embodiments, one or more advantages of the invention include a focus on the ratio of intensity measures from common biallelic SNPs and more robust scaling in DNA quantity or quality at any given SNP. Additionally, in some embodiments, there is no need to assume a known number of individuals present in the mixture or have equal amounts of DNA from each individual present within the mixture.
Furthermore, in some embodiments, it is easy to discern whether the mixture is closer to a population or towards the individual by utilizing a cumulative distance measure.
Whereas few conclusions can be drawn by a SNP measurement that is slightly biased (less than 1%) towards an individual's genotype, considerable confidence can be gained by statistical analysis of the cumulative aggregate of all measurements across hundreds to millions of SNPs. In some embodiments 1,000-100,000 SNPs are used, including the range of 2,000 to 20,000, and 3,000 to 10,000 and approximately 5,000.
[0107] In some embodiments, using the genotypes of a given individual, it is possible to detect an individual's presence or absence in any study with available summary statistics.

SNP SIGNATURES

[0108] As noted above, there are a variety of SNP signatures that can be useful in some or all of the disclosed embodiments. In some embodiments, each SNP
signature comprises a collection of information about various SNPs (such as, for example, allele frequencies). In some embodiments, the SNP signature is a collection of SNP
information regarding the subject, reference population, or test genetic material sample.
In some embodiments, the information is expressed as a percentage. In some embodiments, the information is expressed in absolutes (e.g., presence or absence of a specific allele). In some embodiments, the SNP signature is expressed in terms of raw data that represents the alleles at the SNP. For example, in some embodiments, the SNP
signature can be a fluorescence readout from a SNP array, which indicates which SNPs are present.
[0109] As will be appreciated by one of skill in the art, the size of a SNP
signature (the number of SNPs that make it up) can vary based on how it is to be used. In some embodiments, where one is looking to see if an unknown person contributed to a test genetic material sample, relatively few SNPs are employed as any single unknown SNP present in the test genetic material sample can indicate the presence of an unknown person. In addition, in embodiments in which a lower number of people contributed (or may have contributed) to the genetic material in the test genetic material sample, fewer SNPs will be used than in situations in which a large number of people contributed to the TGMS (test genetic material sample).
[0110] In addition, the number of SNPs used in any one signature can also determine the degree of certainty that one has that the subject contributed to the TGMS.
Thus, in embodiments, where a high degree of certainty is not required, fewer SNPs can be used. In embodiments where a higher degree of certainty is desired, more SNPs can be employed in the SNP signatures.
[0111] In some embodiments, there are enough SNP probes so that the degree of certainty that the person contributed to the test genetic material sample is 1 in at least any of the following: 1000, 10,000, 100,000, 1,000,000, 10,000,000, 100,000,000, 1,000,000,000, 5,000,000,000, or more.
[0112] In addition, in embodiments where one is only looking for the contribution of an unknown individual in a TGMS, as little as a single SNP can be used (assuming, for example, that none of the knowns have that specific SNP).
[0113] Thus, in some embodiments, as little as 1 SNP can be used, although many more can also be used. In some embodiments all of the SNPs in a subject are used.
In some embodiments, all the SNPs across multiple subjects are used. In some embodiments, SNPs from various organisms or cells (such as various cancer cells) are used.
[0114] As will be appreciated by one of skill in the art, while the SNPs used in the various SNP signatures should overlap (that is the same SNPs should be in the sample SNP signature, the reference SNP signature and the subject's SNP signature), not all of the SNPs need to be present in all of the signatures. Thus, the number and identity of SNPs can be different across the different signatures. In some embodiments, the lowest number of SNPs is found in the subject's SNP signature.
[0115] In some embodiments, the SNP signature is at least one SNP. In some embodiments the SNP signature includes more than one SNP, for example 1, 5, 10, 15, 20, 100, 200, 300, 500, 1000, 2000, 3000, 5000, 9,000, 10,000, 15,000, 20,000, 30,000, 40,000, 50,000, 80,000, 90,000, 100,000 SNPs or more, including any amount defined between any two of the preceding values and any amount greater than any one of the preceding numbers.
[0116] A SNP signature can include one or more genotypes of one or more organisms (or cell types, etc.) across any number of individuals. As noted above, some SNP signatures include SNP information for 50,000 or more SNPs for tens, hundreds or more people. Other SNP signatures only include SNP information for a single person, across numerous SNPs, while yet other SNP signatures include SNP information for a single person and as little as a single SNP. Unless noted otherwise, any of the SNP
signatures (sample SNP signature, reference SNP signature, subject's SNP
signature) can vary in the manner noted above.
[0117] As noted above, the SNP signature does not have to be a compilation of mathematical values of the allele frequencies in all embodiments. For example, raw data showing intensity values for the various SNP probes (and thus representing what alleles are present) can be used. Similarly, the frequencies can be examined one at a time, and thus, a massive table of frequencies need not be compared to another massive table of frequencies. In some embodiments, the SNP signature merely represents or correlates to the allele information such that comparisons (mathematical, visual, or otherwise), can be consistently made between the subject and the sample and/or the reference population.
Of course, in embodiments that do not employ SNPs, the consistency of the SNP
is not relevant, but the consistency of the other item being monitored will be.

ANALYTICAL METHODS AND HOW SNP SIGNATURES CAN BE
COMPARED

[0118] In some embodiments, the invention involves the use of any analytical methods that can be used to resolve complex mixtures. In some embodiments, the analytical method used can depend on the objective of the analysis. Non-limiting examples include an assumption that the SNPs on the array are independent from one another, an assumption that multiple SNPs on the arrays are correlated and are not independent (especially in the case of increasing microarray density). Further examples include using population databases such as from the HapMap Project to select a subset of independent markers to be used in the analysis, the use of haplotype-based methods or Linkage Disequilibrium (LD) methods to combine information from correlated SNPs, the use of a Bayesian method to select the most informative SNPs derived from a training dataset, and the use of explicit redundancy in correlated markers.
[0119] In some embodiments, any method that allows for using numerous (e.g., thousands of) low-information content markers to make a cumulative decision about whether a person is, or is not, (or an unknown person is) in a mixture can be employed.
In some embodiments, one can use a likelihood approach, a Wilcoxan-sign rank, a least-squares-fit, a t-test, Pearson correlation, Spearman rank correlation and/or a test of proportions. In some embodiments, any method that allows for using hundreds to thousands of measurements of genetic variants can be employed for the methods described herein.
[0120] As will be appreciated by one of skill in the art, there are a variety of ways of comparing the SNP signatures. While SNP signatures are not required for all of the embodiments described herein, when they are used, they can be compared in a variety of ways. In some embodiments, any comparison, as long as it allows one to determine direction or bias of an allele count and/or frequency within the test genetic material sample relative to an allele count and/or frequency of the reference and an allele count and/or frequency in a subject, can be used. In some embodiments, any of the computational methods disclosed herein can be employed for this. In some embodiments, such as when the SNP signature is shown in terms of raw data or a data readout (such as a fluorescence readout on a SNP array), it can be possible to use the data regarding the SNPs itself in the comparisons. Thus, while allele frequencies expressed as percentages can be used in some embodiments, in some embodiments, the SNP data itself is used in the comparisons.
[0121] Some embodiments of the invention further encompasses software that implements any of the methods and/or steps and/or processes described herein.
Pre-compiled UNIX binaries are available for a software implementation of some embodiments of the method and can be found in the attached Appendix A. In some embodiments, the software can run its analysis using raw data from either Affymetrix or Illumina or by using genotype calls. In some embodiments, the software is also able to normalize the test statistic using the reference population and/or adjust the mean test statistic using a specified individual. In some embodiments, the user can restrict the SNPs considered to a subset of the total available SNPs. For raw input data one can match the distribution of signal intensities for each raw data file to that of the mixture input file (see platform specific analysis). In some embodiments, multiple test statistics and distance calculations are implemented including the noted test statistic, Pearson correlation, Spearman rank correlation and/or Wilcoxon sign test. In some embodiments, the software is configured to determine direction or bias of an allele count and/or frequency within the test genetic material sample relative to an allele count and/or frequency of the reference and an allele count and/or frequency in a subject.

REFERENCE POPULATIONS AND REFERENCE SIGNATURES

[0122] Ancestry and Reference Populations. In some embodiments, one possible assumption of some of the embodiments described herein is that the reference population (and reference SNP signature) should either (a) accurately matched in terms of ancestral composition to the mixture and person of interest or (b) be limited to analysis of SNPs with minimal (or known) bias towards ancestry. In some embodiments, it is useful to recognize that any single SNP will have a small effect on the overall test-statistic.
Moreover, it is realistic that ancestry of the reference population could be determined by analysis of a small subset of SNPs, followed by analysis of a person's contribution to the mixture with a separate set of SNPs (recognizing that nearly 500,000 SNPs are assayed).
[0123] In some embodiments, mismatching ancestry can be accounted for by normalizing the test-statistic using a second reference population matched to the individual of interest obtaining the normalized test-statistic S(Y). If the reference population of the mixture is mismatched, the reference population of the individual of interest will nonetheless normalize the results. Unlike the reference population of the mixture, the individual of interest's reference population is matched to the individual of interest's ancestry or population substructure and thus serves as an anchor for the distribution of T(Y;). Thus one can compute a p-value for observing the result Y; or more extreme for individual Y;, assuming the reference populations for both the mixture and individual of interest are inferred correctly. Additionally, in some embodiments, when matching a reference population to the individual of interest, one can choose the mean reference population test-statistic mean mean(Tpop) as a close relative to normalize for interesting familial relationships or other considerations. one could also choose to estimate the subject's reference population test-statistic standard deviation sd(Tpop) from a heterogeneous population to give a conservative overestimate of the true standard deviation of the test statistic T(Y;). In some embodiments, the reference population matched to the subject accounts for error in selecting the reference population of the mixture.
[0124] In some embodiments, the reference population is ascertained by using ancestral informative markers that are non-redundant with markers used for detecting if a person is in a mixture. In some embodiments, the reference population is ascertained by using multiple reference groups to ascertain a genetic distance. In some embodiments, the reference population is ascertained by adding individuals selected from a database of SNP calls for many individuals to effectively make a 'reference population' matched to ancestrally informative markers. In some embodiments, the reference population is obtained by collecting the SNPs of various suspects, which can optionally include the person of interest. In some embodiments, the reference population is obtained from an individual, such as a cancer patient or candidate that desires to see if she is pregnant. In some embodiments, the reference population is a family or part thereof. In some embodiments, the reference population has no bias. In some embodiments, the reference population has a minimal bias measured by a genetic distance, genomic control, and which can be obtained using a subset of the SNPs not utilized for resolving within the mixture and not in linkage disequilibrium with any SNPs used in the analysis.
In some embodiments, the reference population has a bias, but it is a known bias.
[0125] In some embodiments, the reference population is generally matched to the mixture at the SNPs being interrogated. In some embodiments, one can minimize variability by only utilizing SNPs with small differences (such as measured by low Fst) between cohorts. In some embodiments, one can also use a subset of several thousand SNPs to determine and match the approximate make up of a reference by essentially selecting individuals who have the shortest genetic distance to the mixture.
High-information content SNPs can be used because they will be sensitive to different ancestral populations. In some embodiments, these SNPs are independent of those SNPs used to identify a person, and thus could be restricted to one particular population.
In some embodiments, multiple references can be used and built into an overall likelihood statistic where a posterior probability is calculated.
[0126] In some embodiments, a large number of SNPs can have a correlation between each other, forcing the distribution to deviate from a normal distribution. In some embodiments, one can sample the distribution by computationally adding individuals known not to be in the mixture to the dataset and determining where along the test-statistic they fall. In some embodiments, additional methods, such as using correction for these correlations, can also be used, such as linkage disequilibrium measurements as obtained through the HapMap project.
[0127] In some embodiments, the reference population comprises genetic material from one or more organisms, viruses, cell types, etc. For example, in some embodiments, the reference population can include 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000, 50,000, 100,000, 500,000, 1,000,000, 5,000,000, 10,000,000, 100,000,000, 1,000,000,000, 5,000,000,000 or more different sources of genetic material.
[0128] In some embodiments, more than one reference and/or reference population and/or reference population signature can be employed by extending to a multiple dimensional test-statistic or distance measure.
COMPUTATIONAL ASPECTS

[0129] While the present disclosure outlines the various methods in terms of processes, one of skill in the art will appreciate that any and/or all of the process/steps disclosed herein can be performed on a device. In some embodiments, the device is a computer with relevant software to perform one or more of the processes outlined herein.
In some embodiments, the steps and processes disclosed herein can be implemented using combinations of one or more computing devices, such as webservers or peer-to-peer clients. For example, the steps or processes can be performed on a single computing device, or, alternatively, a single step or process, such as 70 or combination of steps or processes, such as 10-90, 10-70, 20-70, 30-70, 40-70, 50-70, 60 & 70, 70 & 40, 70 & 60, and/or, 70 & 90 can be implemented on a computing device in communication with other computing devices that perform other steps or combinations of steps.
[0130] The systems, methods, and techniques described here can be implemented in computer hardware, firmware, software, or in combinations of them. A
system embodying these techniques can include appropriate input and output components, a computer processor, and a computer program product tangibly embodied in a machine-readable storage component or medium for execution by a programmable processor. A process embodying these techniques can be performed by a programmable processor executing a program of instructions to perform desired functions by operating on input data and generating appropriate output. In some embodiments, the techniques can advantageously be implemented in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input component, and at least one output component.
Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, a processor will receive instructions and data from a read-only memory and/or a random access memory. Storage components suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory components, such as Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), and flash memory components; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and Compact Disc Read-Only Memory (CD-ROM disks). Any of the foregoing can be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits).
[0131] In some embodiments, the entire process, from SNP analysis to final output of a likelihood that a subject's genetic material is in a test genetic material sample is automated and/or computerized. In some embodiments, any of the results from steps 10-90 are output to an end user and/or a memory. In some embodiments, any 1, 2, 3, 4, 5, 6, 7, 8 or 9 processes outlined in FIG. 1B are performed and/or output via a computer. In some embodiments, a computer prepares one or more SNP signatures and a person can make the comparison between the SNP signatures. In some embodiments, a first computer can prepare one or more of the SNP signatures, a second computer can prepare a different SNP signature, and a third computer can compare the different SNP
signatures.
In some embodiments, the SNP signatures are standardized and contained in a memory system, cd, dvd, or other storage device. In some embodiments, such stored or standardized SNP signatures are for reference SNP signatures, subject SNP
signatures, and/or sample SNP signatures. In some embodiments, the software and/or hardware is configured to detect various markers of various SNPs, develop the various SNP
signatures (e.g., subject's SNP signature, test genetic material SNP signature and reference population SNP signature) and compare the SNP signatures.
[0132] In some embodiments, programming is provided that allows for the analysis of a SNP array. In some embodiments the analysis comprises data regarding fluorescence at various locations on the array of fluorescence generally. In some embodiments, the programming allows for the comparison of a first SNP array (such as a subject SNP signature array) with a) second SNP array (such as a reference SNP
signature array) and/or b) a third SNP array (such as a sample SNP signature array).
[0133] In some embodiments, one or more of the steps in FIG. lB are performed by different users and/or devices. In some embodiments, the computer, device, memory, etc., comprises programming to allow for direction or bias of an allele count or frequency within a mixture relative to a reference and an in individual of interest to be determined. In some embodiments, the computer, device, memory, etc., employs one or more of the formulas provided herein.
[0134] In some embodiments, the systems and methods described herein can advantageously be implemented using computer software, hardware, firmware, or any combination of software, hardware, and firmware. In one embodiment, the system is implemented as a number of software modules that comprise computer executable code for performing the functions described herein. In certain embodiments, the computer-executable code is executed on one or more general purpose computers. However, a skilled artisan will appreciate, in light of this disclosure, that any module that can be implemented using software to be executed on a general purpose computer can also be implemented using a different combination of hardware, software or firmware.
For example, such a module can be implemented completely in hardware using a combination of integrated circuits. Alternatively or additionally, such a module can be implemented completely or partially using specialized computers designed to perform the particular functions described herein rather than by general purpose computers.
[0135] Some embodiments of the invention are described with reference to methods, apparatus (systems) and computer program products that can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the acts specified herein to transform data from a first state to a second state.
[0136] These computer program instructions can be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the acts specified herein.
[0137] The computer program instructions can also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the acts specified herein.
[0138] In some embodiments, the invention further encompasses the use of a library of Y; arithmetic means derived from AA, AB, and BB to map genotype calls to expected Y, values to each SNP from individually genotyped samples.
[0139] As noted herein high-density SNP genotyping data was used to resolve complex mixtures. In one embodiment, the method comprises the construction of a series of simulations to evaluate the theoretical limits of resolving an individual within a mixture using the described analytical framework and given characteristics of current generation SNP genotyping microarrays. In some embodiments, the method further comprises experimentally testing the feasibility of detecting if an individual is contributing trace amounts of DNA to highly complex mixtures. Within these simulations and experimental tests, particular focus was given (for some of the embodiments) on complex mixtures - those containing hundreds or thousands of individuals. Such approaches have utility in resolving a mixture of DNA from common surfaces where many individuals have left DNA.
[0140] As demonstrated through proof of principle experiments below, to resolve mixtures where the person of interest is less than 1% of the total mixture, conservatively 25,000 SNPs can be sufficient to achieve a p-value of less than 10"6. If one were to use all the available SNPs, one can easily resolve mixtures where the person of interest is less than 0.1 % of the total mixture to achieve a p-value of less than 10.6.
[0141] In some embodiments, the invention involves a cumulative analysis of shifts in allele probe intensities in the direction of the individual's genotype. In some embodiments, the invention involves a method of measuring the difference between the distance of the individual from a reference population and the distance of an individual from the mixture. In some embodiments, one advantage the invention holds over other methods in field is that the method does not require knowledge of the number of individuals in the mixture and is capable of discriminating an individual source from a mixture comprising over one thousand sources.
[0142] The above discussion and Example 1 provides an explanation of some of the embodiments with modifications in response to various factors including homogeneity of the mixture and accuracy of the reference populations.
[0143] The following examples are offered for illustrative purposes only, and are not intended to limit the scope of the present invention in any way.
Indeed, various modifications of the invention in addition to those shown and described herein will become apparent to those skilled in the art from the foregoing description and fall within the scope of the appended claims.

EXAMPLE I
[0144] Complex Mixture Constructions. A total of 8 complex mixtures were constructed (See Table 1). Concentrations of all DNA samples were checked in triplicates using the Quant-iT PicoGreen dsDNA Assay Kit by Invitrogen (Carlsbad, CA). An eight point standard curve was prepared using Human Genomic DNA from Roche Diagnostics (Cat#: 11691112001, Indianapolis, IN). The median concentrations were calculated for each individual DNA sample.

\un Desclip ion 111un~in~~ ~11~ n~~lria ~.0 ~~OK 450S

Mixture A Equimolar pool. Equimolar mixture of Yes No Yes 41 CEU individuals (14 Trios minus one individual) ....................................
...............................................................................
................................ ............................
......................... ........................
................................................................
Mixture B Equimolar pool. Equimolar mixture of Yes No Yes 47 CEU individuals (16 Trios minus one individual) ................................................:..............................
...............................................................................
...............................................................................
...:................................................................
Mixture C 2-person mixture. 90% one CEU Yes No Yes individual, 10% a second CEU
individual ...........................
....................<..........................................................
...............................................................................
............................ ........................
................................................................
Mixture D 2-person mixture. 99% one CEU Yes No Yes individual, 1% a second CEU
individual ........................................... . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . ................................
......................... ........................ . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . ........................
Mixture E Complex mixture. Mixture with 184 Yes No No individuals at -0.2% each, and 41 individuals from Mixture A at -1%
each.
...............................................................................
...............................................................................
..............................................................................
................................................................
Mixture F Complex mixture. Mixture with 184 Yes No Yes individuals at -0.2% each, and 47 individuals from Mixture Bat -I%
each.

Mixture G Complex mixture. Mixture with 184 No Yes No individuals at -0.2% each, and 41 individuals from Mixture B at -0.1 %
each.
........................
................................................................
................................................
:..............................................................................
........... ...................................................
.........................
Mixture H Complex mixture. Mixture with 184 No Yes No individuals at -0.5% each, and 47 'Fume [)cscrip ion lllumina \liv mCtriy 5.O
individuals from Mixture B at -0.1 %
each.
[0145] Mixtures Al, A2, BI, and B2: Equimolar mixtures of HapMap individuals. Shown in Table 1, two main mixtures (mixtures A and B) were composed in duplicates resulting in a total of 4 mixtures. Mixture A was composed of 41 HapMap CEU individuals (14 trios minus one individual) and mixture B was composed of HapMap CEU individuals (16 trios minus one individual).
[0146] Mixture Cl: 90% NA12752 and 10% NA07048. Two CEU males were combined in a single mixture so that one individual (NA12752) contributed 90%
(675ng) of the DNA in the mixture, while the other individual (NA07048) contributed 10% (75ng) DNA into the mixture by concentration.
[0147] Mixture C2: 90% NA10839 and 10% NA07048. Two CEU
individuals, a female and a male, were combined in a single mixture so that one individual (NA10839) contributed 90% (675ng) of the DNA in the mixture, while the other individual (NA07048) contributed 10% (75ng) DNA into the mixture by concentration.
[0148] Mixture D1: 99% NA12752 and 1% NA07048. Two CEU males were combined in a single mixture so that one individual (NA12752) contributed 99%
(742.5ng) of the DNA in the mixture, while the other individual (NA07048) contributed 1% (7.5ng) DNA into the mixture by concentration.
[0149] Mixture D2: 99% NA10839 and 1% NA7048. Two CEU
individuals, a female and a male, were combined in a single mixture so that one individual (NA10839) contributed 99% (742.5ng) of the DNA in the mixture, while the other individual (NA07048) contributed 1% (7.5ng) DNA into the mixture by concentration.
[0150] Mixture E: 50% Mixture Al and 50% Mixture of 184 equimolar Caucasians. Two mixtures were combined into a single mixture so that each of the original mixtures contributed the same amount of genomic DNA by volume into the final mixture. CAU2 mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture Al was constructed as above and contained 41 CEU
individuals.
[0151] Mixture F: 50% Mixture B2 and 50% Mixture of 184 equimolar Caucasians. Two mixtures were combined into a single mixture so that each mixture contributed the same amount of genomic DNA by volume into the final mixture.

mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture B2 was constructed as above.
[0152] Mixture G: 5% Mixture A2 and 95% Mixture of 184 equimolar Caucasians. Two mixtures were combined into a single mixture with Mixture A2 comprising of 5% of the mixture and the CAU3 comprising of 95% of the mixture.

mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture A2 was constructed as above.
[0153] Mixture H: 5% Mixture B1 and 95 % Mixture of 184 equimolar Caucasians. Two mixtures were combined into a single mixture with Mixture B 1 comprising of 5% of the mixture and the CAU2 comprising of 95% of the mixture.

mixture contained 184 Caucasian control individuals obtained from the Coriell Cell Repository. Mixture B 1 was constructed as above.
[0154] Genotyping. Four cohorts were assayed on the Illumina (San Diego, CA) HumanHap550 Genotyping BeadChip v3, one cohort was assayed on the Illumina (San Diego) HumanHap450S Duo, and three cohorts were assayed on the Affymetrix (Emeryville, CA) Genome-Wide Human SNP 5.0 array, with each cohort being assayed on a single chip. Probe intensity values were extracted for analysis from the file folders generated by the BeadScan software for the Illumina platform, and from Affymetrix GTYPE 4.008 software for the Affymetrix data, as described in previous studies (See Pearson, J.V. et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies.
Am J
Hum Genet 80, 126-139 (2007)).
[0155] Platform specific analysis. With the Affymetrix platform the genotypes were used for each individual and found similar results with the Illumina platform. Additionally, the raw CEL files were used from the HapMap dataset (See The International HapMap Project. Nature 426, 789-796 (2003)) found at the world wide web at HapMap.org. To overcome the differences in distribution of signal intensity between CEL files, the distribution of the signal intensities were matched to the distribution of the mixture's CEL file. This was achieved by ordering allele frequencies on a given chip (and allele frequencies in the mixture). The it" allele frequencies from the mixture of interest were substituted for the ith allele frequencies of the given chip.
Without this adjustment, there was difficulty resolving any individual in any mixture due to the fact that off-target cross-hybridization was not accounted for. In some embodiments, this type of adjustment is the preferred type of normalization method when raw data is available for the mixture, person of interest, and reference population.
[0156] With the Illumina platform the genotypes from the HapMap dataset (See The International HapMap Project. Nature 426, 789-796 (2003)) were used of both the person of interest and the reference populations instead of raw intensity values as had been done with the Affymetrix platform. With the mixture the raw intensity values were used. This set of data mimics the case where raw data may not be available but genotype calls are available. Reduction in errors between different microarrays was achieved by normalizing each microarray by dividing by the mean channel intensity from each respective channel. This was performed on the raw data from the mixture. This platform specific adjustment may not be needed when the raw data of a person's genotype is present on the same platform. In the Illumina specific example, the calls from the HapMap were utilized without having platform specific genotype data.
[0157] Simulation. Simulation was used to test the efficacy of using high-density SNP genotyping data in resolving mixtures. The relevant variables of the simulation are: the number of SNPs s, the fraction f of the total DNA mixture contributed by the person of interest Y,, and the variance or noise inherent to assay probes vp. In the simulations, theoretical mixtures were composed by randomly sampling individuals from the 58C Wellcome Trust Case-Control Consortium (WTCCC) dataset (See Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 447, 661-678 (2007)). After removing duplicates, relatives and other data anomalies, a total of 1423 individuals remained. The genotype calls for these individuals were provided from the WTCCC and were previously genotyped on the Affymetrix platform. Within each simulation, N individuals were randomly chosen to be equally represented in the mixture and then computed the mean allele frequency (Y;) of the mixture for each SNP. SNPs j with an observed Y below 0.05 or above 0.95 in the reference population were removed due to their potential for having false positives and low inherent information content.
[0158] A microarray was simulated that would contain a mean of 16 probes for simplicity, approximating the mean number of probes found on the Illumina 550K, Illumina 450S Duo and Affymetrix 5.0 platforms (18.5, 14.5 and 4 respectively). For each SNP j the Y of each probe was added to a Gaussian noise based off the previously measured probe variance. When fixed, probe variance was set to 0.006 when simulating Affymetrix 5.0 arrays, and to 0.001 for both Illumina 550K and Illumina 450S
Duo arrays. The allele frequency of the mixture was then calculated to be the mean of these probe values. A mixture size of N is equivalent to saying that an individual's DNA
represents f=1/Nth of the total DNA in the mixture. Equimolar mixtures ranging from 10 individuals to 1,000 individuals were tested. Using this design, each individual was tested for their presence where they contributed between 10% and 0.1% genomic DNA to the total mixture. To obtain significance levels (p-values) to test the null hypothesis, the normal distribution was sampled. There were not enough samples to test the tail of the distribution and therefore the p-values are not completely accurate (e.g.
below 10"6).
Nonetheless, p-values are expected to be sufficiently accurate to qualitatively assess the limits of the method.
[0159] Joint adjustment of mixture fraction (9 and number of SNPs (s). The trade-off between the numbers of SNPs considered versus the fraction of the DNA
mixture belonging to the person of interest were tested. One expects greater ability to resolve individuals from a mixture when more SNPs are used in the calculation, though the absolute limits of detection are ultimately determined by the genetic variation of the population. A variance (vp) was assumed for the estimated allele frequency of each probe of 0.001, which follows closely the observed variance (0.00158) of the Illumina 550K
platform across multiple arrays in other genotyping studies. FIG. 2a shows 10,000 simulations ranging from s=10 to s=500, 000 and f=0.1 to f=0.001, where the Z-axis is the p-value. With 10,000 to 25,000 SNPs it was possible to resolve mixtures where the person of interest was less than 1% of the total mixture at a p-value of less than 10"6. The shading on the pvalues for FIG. 2a is noted in the bar beneath the graph. Dark grey is present primarily on the lower and left-hand side, followed by a band of white (as one moves upward and to the right), followed by an area of grey.
[0160] Joint adjustment of probe variance (vp) and mixture fraction (9. In these simulations, it was assumed that there were 50,000 SNPs on each microarray (s=50,000). While conceivably a much greater number of SNPs could be used, the lower number of SNPs would be more realistic in a setting where preference has been given to SNPs whose allele frequencies minimally vary across different populations.
FIG. 2b shows 10,000 simulations from vp=0.0001 to vp=0.01 and f=0.1 to f=0.001. It is clear that within a small amount of probe variance one is able resolve an individual who comprises of one-thousandth of a mixture. If the probe variance is below 0.001 one can easily resolve an individual whose DNA comprises 10% to 0.1% of the mixture.
Even with increasing noise, one is still able to resolve mixtures where the person of interest contributes less than 2.5% with a p-value of less than 10-6. One can also observe that the probe variance does not have a large impact on the p-value, and in this case the fraction of the mixture is the important factor when the number of SNPs is fixed. The shading on the pvalues for FIG. 2b is noted in the bar beneath the graph. Dark grey is present primarily on the lower and right-hand side, followed by a band of white (as on moves left and upward across the graph), followed by an area of grey.
[0161] Joint adjustment of number of SNPs (s) and probe variance (vp).
Finally the trade-off between the number of SNPs and the probe variance was examined.
It was assumed that the person of interest contributes 1% to the mixture (f=0.01). FIG.
2c shows 10,000 simulations from s=10 to s=500,000 and vp=0.0001 to vp=0.01.
The probe variance has little effect on the significance of the test.
Consequently, it would be sufficient to use 50,000 SNPs, even with very high levels of noise to resolve mixtures of sizes up to 100. Within simulations, the number of probes is fixed to be 16, and thus the noise does not affect the allele frequency estimate, as would be the case with arrays using 4 probes. The shading on the pvalues for FIG. 2c is noted in the bar beneath the graph.
Dark grey is present primarily on the left-hand side, followed by a band of white (as one moves to the right), followed by an area of grey.
[0162] Equimolar mixtures versus two person mixtures. The same three simulation designs were performed using mixtures that included two individuals. Instead of N=1/f individuals contributing equally to the mixture, mixtures were created where individual one would make up (N-1)/N of the mixture and individual two would make up 1/N of the mixture. When the three simulations were performed an increase in significance (smaller p-values) was observed. This gives further utility to the method when there are a small number of total contributors with the person of interest making up a small fraction of the mixture.
[0163] Conclusions from simulations. Herein it was demonstrated that 10,000 to 50,000 SNPs to resolve mixtures where the genomic DNA of the person of interest composes 10% to 0.1 % of the DNA within the total mixture. Perhaps counter intuitively, noise plays an important but secondary role since microarray technologies such as the Illumina 550K and Illumina 450S Duo platforms have a sufficiently large number of replicate probes compared to population sampling variance. Another consideration is that the choice of SNPs was not made with any specific intent and therefore one could reduce the number of SNPs significantly if one choose the most informative SNPs, for example by choosing a set of SNPs that do not vary across differing populations.

[0164] Experimental Validation. To examine empirically the efficacy of the above noted method various known mixtures were formed of DNA from HapMap individuals and genotyped the mixtures on three different platforms. Listed in Table 1 and detailed herein are the compositions of the different mixtures formed and the platforms they were assayed across. The use of mixtures of HapMap individuals has several advantages. First, one can be confident of the genotype calls because in most cases more than one platform has been used to identify the consensus genotype.
Second, trios are available, which allow the evaluation of identifying an individual using a relative's genotype data. Third, by using mixtures of multiple HapMap individuals one can evaluate the ability to resolve each individual within the mixture.
Therefore simple two-person mixtures were constructed as well as complex mixtures containing contributions from 40+ individuals. With each mixture, the HapMap CEU
individuals not present in the mixture were used as the reference population of the mixture.
[0165] Resolving an individual within mixtures of 40+ individuals. FIG. 3 shows the test-statistic for each individual within each mixture. Both individuals in the mixture and not in the mixture were tested for presence within the mixture. On each graph, the left y-axis represents the -log p-value, the right y-axis represents the normalized test-statistic S(Yij), and the bottom axis represents each individual. Each experiment was performed more than once and thus there are multiples of 86 individuals indexed on the bottom axis. For mixtures A, B, E, F, G and H, those in the mixture are shaded lightly and identified and those not in the mixture are shaded darker and identified. All individuals in the mixtures composed of more than 40 individuals were identified with zero false positives [0166] Resolving members within 2 person mixtures (f=]% and f=10%). For mixtures C and D, those individuals who are not in the mixtures are shaded dark and identified, those individuals who are related to a person in the mixture are colored orange, and those people in the mixture are shaded lighter and identified. It was possible to correctly identify individuals within the mixture with zero false-positives except, as expected, for relatives of individuals in the mixture, which appear at a midpoint between those in and those not in the mixture.
[0167] Resolving an individual from a mixture using a relative's genotypes. It is interesting to observe that there were no false-positives in the Mixture A, B, E, F, G or H but there were false-positives when considering Mixture C and D. This is not unexpected since the HapMap CEU population is composed of trios and one is in fact resolving that the mother or father of the individual (a son or daughter) is in the mixture;
data point indicated as "1-10" and "90-99" marked individuals being observed as significant in FIGs 3a and 3c. Thus, one can easily resolve an individual (son or daughter) even when using their mother's genotypes or father's genotypes.
[0168] Resolving an individual from a mixture with 50, 000 SNPs. In FIG. 3a, one can observe that all the mixtures are able to be resolved with no false-negatives when one uses all 504,605 SNPs present on the Illumina 550K platform. The same analysis was performed considering 50,000 SNPs (see FIG. 3b) and found that the samples had the same degree of separation. Thus, even if a small fraction of the intended genotypes are generated (such as in a degraded sample), identification of an individual in a complex mixture is possible.
[0169] Resolving an individual when contributing less than 1%. In FIG. 3d, mixtures G and H were considered where the fraction of DNA of each individual is between 0.15% and 0.25% of the total mixture. One can see that using all the SNPs available one was able to resolve all the mixtures with no false-negatives on the Illumina 450S Duo platform. One can therefore resolve an individual even when the fraction of their DNA in the mixture is less than 1%.

[0170] This example demonstrates a method to detect the presence of an individual's genetic material (nucleic acid) in a complex mixture of genetic material from multiple subjects.
[0171] First, a reference sample of genetic material is created to provide an estimate of the mean allele frequencies of SNPs in the population represented by the reference sample (to obtain a reference SNP signature). The reference sample can be constructed by obtaining samples of genetic material from a commercial provider, such as the Coriel Cell Repository (Coriel Institute for Medical Research, Camden, NJ). The reference sample is composed of genetic material from one hundred individuals of Caucasian descent. The genetic material for the reference sample is available from the Coriel Cell Repository, Catalog number HD1000AU.
[0172] Next, the specific SNPs to be included in the analysis are selected.
The allele frequencies of all selected SNPs in the reference sample are measured. Once measured, SNPs with a mean allele frequency less than 0.05 or greater than 0.95 are eliminated from consideration. All remaining SNPs are selected for use in the subsequent analysis, and the mean allele frequencies from those remaining SNPs are recorded.
Alternatively, the allele frequencies of the selected SNPs can be obtained from a database that has previously measured the allele frequencies of the selected SNPs in a comparable reference population.
[0173] Next, a complex mixture that contains DNA from numerous sources is collected and the mean allele frequencies of the SNPs selected above are then determined for the complex mixture.
[0174] Next, a sufficient amount of DNA is taken from a person of interest (or subject). This DNA is analyzed to determine the allele frequencies of the selected SNPs in the DNA from the person of interest.
[0175] Finally, the data obtained from the SNPs of the person of interest is compared with the data obtained from the reference sample and the data from the mixture to determine the source of the unknown sample. This process is repeated for a sufficient number of the selected SNPs to obtain the degree of certainty desired for establishing the match of the person of interest's DNA to the DNA in the complex mixture. The results from each SNP are combined and the output indicates the likelihood that the genetic material in the complex mixture belongs to the individual of interest.

[0176] In this example, the methods in the current disclosure are used for a forensic application. First, a reference sample of genetic material is assembled to provide an estimate of the mean allele frequencies of the SNPs to be analyzed in a given human population. The reference sample is constructed by obtaining samples of human genetic material from a commercial provider such as the Coriel Cell Repository (Coriel Institute for Medical Research, Camden, NJ). Genetic material from various human populations is available from the Coriel Cell Repository, including panels of individuals of Caucasian, African American, Middle Eastern, Asian, and other ethnic descents. In this example, reference samples representing panels of 10 or more individuals of Caucasian, African American, Middle Eastern, and Asian descent are obtained from the Coriel Cell Repository and combined to form the reference sample. The reference sample is then tested to determine the mean allele frequencies of all available SNPs and create a reference SNP signature. Alternatively, the mean allele frequencies of the SNPs to be analyzed can be obtained from a commercial database (thereby obtaining the reference SNP signature). SNPs returning a frequency value below 0.05 or above 0.95 can optionally be eliminated from consideration.
[0177] Next, a subject SNP signature is created by obtaining genetic material from the individual who is suspected of contributing genetic material to a sample obtained at a crime scene. The allele frequencies of the selected SNPs are measured for a genetic material sample from the subject to obtain the subject SNP signature.
[0178] Next, the sample of genetic material from the crime scene (test genetic material sample) is analyzed. The test genetic material sample is analyzed and the mean allele frequencies of the selected SNPs are obtained and recorded, thereby providing the sample SNP signature.
[0179] Finally, each of the signatures is compared to determine whether the unknown sample taken from the crime scene belongs to the subject. The subject SNP
signature (e.g., the allele frequency of each SNP for the subject) is compared to the reference SNP signature (e.g., the mean allele frequency of the same SNP in the reference) and compared to the sample SNP signature (the mean allele frequency in the test genetic material sample).
[0180] The output can be expressed in terms of the likelihood that the subject contributed to the test genetic material sample.

[0181] In this example, the methods in the current disclosure are used to conduct a forensic analysis of a sample that has been degraded as a result of exposure to environmental or other factors.
[0182] A reference sample of genetic material is assembled to provide an estimate of the mean allele frequencies of the SNPs to be analyzed in a given human population, and thereby provide a reference SNP signature. Genetic material from various human populations is available from the Coriel Cell Repository, including panels of individuals of Caucasian, African American, Middle Eastern, Asian, and other ethnic descents. Genetic material samples representing panels of 10 or more individuals of Caucasian, African American, Middle Eastern, and Asian descent are obtained from the Coriel Cell Repository and combined to form the reference sample. The reference sample is then tested to determine the allele frequencies of all available SNPs, thereby creating a reference SNP signature. Optionally, SNPs returning a frequency value below 0.05 or above 0.95 are eliminated from consideration.
[0183] A subject's genetic material is then collected from one or more individuals that are suspected of contributing genetic material to a test genetic material sample. In this example, genetic material is collected from 10 different suspects who had access to the location of the test genetic material sample. The genetic material from all individuals is combined to form a mixture sample, and the allele frequencies of the selected SNPs are measured, thereby forming a subject SNP signature.
[0184] Next, the degraded sample of genetic material is analyzed. The allele frequencies of the selected SNPs are measured and recorded, creating a sample SNP
signature.
[0185] Finally, the signatures (or at least a part thereof) obtained from each sample are compared to determine whether the degraded sample belongs to one of the 10 individuals who contributed genetic material to the test genetic material sample. The allele frequency of at least some of the SNPs in the degraded sample is compared to the mean allele frequency of the same SNPs in both the reference sample and the mixture sample. This process is repeated as many times as necessary for the selected SNPs. One thereby obtains enough SNP comparisons to determine if one of the 10 subjects contributed to the genetic material in the test genetic material sample.

[0186] In this example, the methods of the current disclosure are used to determine whether a human female is pregnant.
[0187] First, a suitable sample (a sample that can contain genetic material from a fetus in the host) is taken from the female host for analysis. The genetic material in the sample is isolated and a sample SNP signature is prepared from the genetic material. A subject SNP signature is then prepared by using a sample from the female subject.
[0188] The sample SNP signature is compared to the subject SNP signature, and if the comparison reveals that another person's genetic material is present, such as through additional SNPs, one concludes that the host is pregnant.
[0189] In the alternative, a further reference SNP signature can be used from an appropriate reference population, and the comparison can be between a) the subject SNP signature and each of b) the reference SNP signature and the sample SNP
signature.

[0190] In this example, the methods of the current disclosure are used to determine the paternity of an unborn child.
[0191] First, a suitable sample is taken from a pregnant female for analysis.
The sample will include genetic material from the unborn child. The SNPs in the sample are determined and a sample SNP signature is obtained from the unborn child.
The sample can optionally include the mother's genetic material.
[0192] Next, a suitable sample is obtained from the potential father and a SNP
signature is prepared for the potential father.
[0193] The SNP signature of the potential father can be compared to the sample SNP signature, and when the sample SNP signature only includes genetic material from the child, the likelihood that the potential father is the father of the child can be determined.
[0194] In the alternative, a reference SNP signature can be prepared and the SNP signature of the potential father can be compared to each of the reference SNP
signature and the sample SNP signature to determine if the potential father contributed to DNA of the unborn child.
[0195] As will be appreciated by one of skill in the art, one is not looking for specific matches between the SNPs in the sample SNP signature and the SNP
signature of the potential father, but rather a degree of similarity that is consistent with paternity.

[0196] In this example, a method is used to determine whether unknown tissue remains are of bovine or human origin. First, a reference sample is created by obtaining a sample of bovine genetic material. The bovine genetic material can be obtained from a donor bovine animal, or can be obtained from a commercial provider, such as the Coriel Cell Repository. The sample of bovine genetic material is prepared and analyzed to determine the mean allele frequencies of 1,000 SNPs. Remaining SNPs are selected for analysis and their values are recorded.
[0197] Next, a sample of human genetic material is prepared. The human genetic material can be obtained from a human donor, or can be obtained from a commercial provider, such as the Coriel Cell Repository. The human genetic material is analyzed, using the methods in the current disclosure, to determine the mean allele frequencies of the selected SNPs. Once obtained, the values are recorded.
[0198] Next, a sample of genetic material is prepared from the unknown tissue remains. The unknown sample is analyzed and the mean allele frequencies of the selected SNPs are obtained and recorded.
[0199] Finally, the data obtained from each sample are compared to determine the source of the unknown sample. The mean allele frequency of each SNP in the unknown tissue remains sample is compared to the mean allele frequency of the same SNPs in each of the bovine sample and the human sample. If the SNP frequencies of the unknown sample are more similar to the bovine allele frequencies, it will indicate a lower chance that the sample is human and if the SNP frequencies of the unknown sample are more similar to the human allele frequencies, it will indicate a lower chance that the sample is bovine. The results from each SNP are combined and summed, and the output indicates whether the unknown tissue remains are of bovine or human origin.

[0200] Many cell lines are most successfully cultured by growing the cells of interest along with supporting cell types. Examples include culturing human embryonic stem cells on a layer of mouse embryonic feeder cells, or growing primary human hepatocytes in co-culture with rat microvascular endothelial cells. In some embodiments, the methods in the current disclosure provide a quick and accurate method for distinguishing between cells of interest and supporting cells.
[0201] In this example, an embryonic stem cell line is cultured in co-culture with several different mouse embryonic feeder cells for several passages.
After culturing the embryonic stem cells for several passages, the embryonic stem cells are isolated from the mouse embryonic feeder cells. The methods of the current disclosure are then used as described below.
[0202] First, a reference sample is created by combining genetic material from the several different feeder cell lines that are used to culture the embryonic stem cell line of interest. The mean allele frequencies of numerous available SNPs in the reference sample are measured and the values are recorded.
[0203] Next, a sample of genetic material is obtained from the cell line of interest. In this example, the cell line of interest is a human embryonic stem cell line that is available from the NIH. A sample of this cell line is obtained, and the allele frequencies of the selected SNPs are measured and recorded.
[0204] After being successfully cultured for one or more passages in a co-culture with the three different types of feeder cells, the embryonic stem cells of interest are isolated from the feeder cells. To confirm that the embryonic stem cells have been successfully isolated from the feeder cells, a sample of isolated embryonic stem cells is collected and the genetic material from the cells is prepared for analysis.
The mean allele frequencies of the selected SNPs in the sample are obtained and recorded.
[0205] Finally, the data obtained from the sample of isolated embryonic stem cells are compared to the data obtained from each of the embryonic stem cell sample and the feeder cell mixture sample. The allele frequency of each SNP in the isolated embryonic stem cell sample is compared to the mean allele frequency of the same SNP in each of the embryonic stem cell sample and feeder cell mixture sample. This process is repeated for all of the selected SNPs. The results from each SNP are combined and the output indicates whether the isolated embryonic stem cell sample is free of feeder cells.

[0206] When a biopsy is performed on a tumor, cells from the tumor are typically analyzed to determine whether the cells are malignant or benign. The methods in the current disclosure can be used to analyze cells from a tumor biopsy and determine whether those cells are malignant or benign.
[0207] First, a benign tumor sample is created by combining genetic material from several different known benign tumor cells and/or healthy cells. In this example, several different known forms of benign bone tumors are used to create the sample. The mean allele frequencies of all available SNPs in the benign tumor sample are measured and the values are recorded.
[0208] Next, a malignant tumor sample is created to represent the different types of malignant bone cancers. In this example, several different known forms of malignant bone tumors are used to create the sample. Genetic material from malignant tumors classified as multiple myeloma, osteosarcoma, Ewing's sarcoma, and chondrosarcoma are combined to create the malignant tumor sample. The mean allele frequencies of the selected SNPs in the malignant tumor sample are measured and the values are recorded.
[0209] Next, a tissue biopsy is obtained from an unknown bone tumor and cells are isolated from the biopsied tissue using methods that are well known in the art.
The genetic material from the cells is isolated and the mean allele frequencies of the selected SNPs are measured and recorded.
[0210] Finally, the data obtained from the tumor biopsy sample are compared to the data obtained from each of the benign tumor sample and the malignant tumor sample. The mean allele frequency of each SNP in the unknown tumor biopsy sample is compared to the mean allele frequency of the same SNP in each of the benign tumor sample and the malignant tumor sample. This process is repeated for a sufficient number of the selected SNPs. The results from each SNP are combined, and the output indicates whether the tumor is composed of benign or malignant cells.

[0211] This example demonstrates one method of comparing allele frequencies for a SNP. A first set of SNP data are identified as the reference population, and a second set of SNP data are identified as the mixture population. For each individual SNP, the allele frequency values of the data in the reference population are averaged to provide a mean allele frequency value for each SNP in the reference population (thereby providing a reference SNP signature). This process is repeated with the mixture population, providing a mean allele frequency value for each SNP
in the mixture population (thereby providing a sample SNP signature).
[0212] For any given subject's SNP, the value of the allele frequency at each subject's SNP is compared to the mean allele frequency value of the same SNP
in both the reference population and the sample SNPs from the mixture.
[0213] For the first SNP to be analyzed, the mean allele frequency of the SNP
in the mixture is subtracted from the SNP allele frequency value of the subject, and the absolute value of this difference is stored. Next, the mean allele frequency of the SNP in the reference population is subtracted from the SNP allele frequency value of the subject, and the absolute value of this difference is stored. Finally, a value is obtained for the individual SNP by subtracting the absolute value of the first value from the second value.
[0214] A negative value (down to -0.5) denotes that the subject is likely to be in the reference population. A positive value (up to 0.5) denotes that the subject is likely to be in the mixture, and a value of 0 denotes that the subject is equally likely to be in the mixture and the reference population.
[0215] In some embodiments, the above process can be repeated across all SNPs to be included in the analysis, and the value Yi j obtained for each SNP
is summed as follows:

D(Y;j) _ IY;j - Pope - lY;j - M,I (Equation 1).

[0216] The summation result is used to determine whether the subject is a member of the mixture population, a member of the reference population, or neither.
Additionally, a one-sample t-test for individual i can be taken and used to obtain a test statistic as follows:

T(Y;) = (mean(D(Y;j)) - o) / (sd(D(Y;j)/ sqrt(s))) (Equation 2) One can use multiple references, extending this formula to a multi-dimensional test statistic. This may be especially useful for a person of mixed ethnicity, though no not necessary.

[0217] Different populations will have different mean SNP allele frequencies based on the genetic heritage of the population. This example provides one method of constructing a reference population for use with the methods of the current disclosure.
Such a reference population can be used to manage the effect of ancestry on the allele frequencies observed across many samples.
[0218] First, the subject's population is identified. If the subject is of Caucasian ancestry, a reference sample is created based on a Caucasian population. The reference sample can typically include samples from ten or more individuals who are members of the target population. Ideally, the individuals represent typical members of the target population. In a target population of Caucasian ancestry, the samples used to create the reference sample can include both female and male Caucasian individuals.
[0219] Next, the reference population sample is constructed by obtaining representative samples of genetic material from members of the target population. The reference population sample can be constructed by obtaining samples of genetic material from individual donors. Ten Caucasian donors are chosen to create the reference population sample. Five of the donors are Caucasian females and five of the donors are Caucasian males.
[0220] Samples of genetic material are obtained from each reference donor.
The allele frequencies of each SNP are measured in each sample, and the results are recorded. The values obtained for each SNP are summed across all ten of the donor samples and the mean allele frequency value is determined. The mean allele frequency value of each SNP (e.g., a reference SNP signature) can then be used in subsequent analyses as the mean allele frequency value of the reference population.

[0221] During the investigation of a crime, it can be useful to establish that a particular individual or individuals did not contribute genetic material to a given forensic sample. This can be touching a common surface, such as a door handle, toilet seat, or other common surface. In this example, the methods in the current disclosure are used to verify that genetic material from a given subject is not present in a forensic sample.
[0222] First, a sample of genetic material is obtained from a subject. The sample is analyzed and the allele frequencies of the SNPs in the sample are determined (providing a subject SNP signature).
[0223] Next, genetic material is isolated from the forensic sample. The sample is analyzed and the allele frequencies of the SNPs in the sample are determined (providing a sample SNP signature).
[0224] Once the allele frequencies of the SNPs have been obtained for both the subject and the forensic sample, one compares the two in order to see if there are any SNPs present in the subject SNP signature that are absent from the sample SNP
signature.
A significant number of absent SNPs will indicate that the subject did not contribute to the forensic sample.
[0225] In the alternative, the comparison can also include a reference SNP
signature, where the subject's genetic material is also represented in the reference SNP
signature, and the comparison can be between a) the subject SNP signature and the reference SNP signature, and b) the subject SNP signature and the sample SNP
signature, in order to demonstrate that the subject is more likely to have contributed to the reference population than to the forensic sample.

[0226] A forensic sample can contain genetic material from one or more unknown individuals. This example demonstrates how the currently disclosed methods can be used to determine whether a complex sample contains genetic material from one or more unknown subjects.
[0227] Genetic material from a forensic sample is isolated and characterized to obtain a sample SNP signature.
[0228] Genetic material from a subject is isolated and characterized to obtain a subject SNP signature.
[0229] Genetic material from a reference sample is isolated and characterized to obtain a reference SNP signature. The subject will be a member of the reference population and thus represented in the reference SNP signature.
[0230] The three SNP signatures are compared and the results indicate that the subject is not likely to have contributed to the genetic material in the forensic sample or that, while the subject did contribute to the forensic sample, at least one other subject, with a SNP signature difference from the subject's SNP signature, also contributed to the forensic sample.

[0231] This example demonstrates one method of determining if any one of a number of subjects contributed to a test genetic material sample.
[0232] Genetic material from a forensic sample is isolated and characterized to obtain a sample SNP signature.
[0233] Genetic material from 100 subjects is isolated and characterized to obtain a subject SNP signature. The subject SNP signature includes the mean frequencies of the various SNPs across the 100 subjects.
[0234] Genetic material from a reference population is isolated and characterized to obtain a reference SNP signature.
[0235] The three SNP signatures are compared, as described herein. The results demonstrate that at least one of the 100 subjects contributed to the test genetic material sample. In an alternative arrangement, additional individual comparisons can be made to determine which of the 100 subjects contributed to the test genetic material sample.

[0236] This Example outlines how one can analyze SNP signatures. One obtains a reference SNP signature, a subject SNP signature, and a sample SNP
signature.
Each of the signatures includes the intensity levels from SNP microarrays from one of the microarrays of a reference sample, a subject sample, or a test genetic material sample.
One then compares two models, one where the individual of interest is assumed to be in the mixture, and another where the individual of interest is assumed not to be in the mixture, in the form of a posterior odds ratio (as explained in the detailed description above). One derives the likelihood of each of the two models using Bayesian inference to accurately assess the probability of the observations (as described in the detailed description above). With this method, a more robust and accurate model of the observations is created, giving a better statistical measure of evidence.

INCORPORATION BY REFERENCE

[0237] All references cited herein, including patents, patent applications, papers, text books, and the like, and the references cited therein, to the extent that they are not already, are hereby incorporated by reference in their entirety. In the event that one or more of the incorporated literature and similar materials differs from or contradicts this application, including but not limited to defined terms, term usage, described techniques, or the like, this application controls. In addition, "Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP
Genotyping Microarrays," PLoS Genentics, August 2008, Vol. 4, 8, p. 1-9, is hereby incorporated by reference in its entirety, including any discussion regarding the methods disclosed therein, various applications of those methods, various formulas regarding the methods, and how to define and derive the various components of those formulas.

EQUIVALENTS
[0238] The foregoing description and Examples detail certain specific embodiments of the invention and describes the best mode contemplated by the inventors.
It will be appreciated, however, that no matter how detailed the foregoing may appear in text, the invention may be practiced in many ways and the invention should be construed in accordance with the appended claims and any equivalents thereof.
[0239] The use of the words "function," "means" or "step" in the Detailed Description or Description of the Drawings or claims is not intended to indicate a desire to invoke the special provisions of 35 U.S.C. 112, 6, to define the invention. To the contrary, if the provisions of 35 U.S.C. 112, 6 are sought to be invoked to define the inventions, the claims will specifically and expressly state the exact phrases "means for"
or "step for, and will also recite the word "function" (i.e., will state "means for performing the function of [insert function]"), without also reciting in such phrases any structure, material or act in support of the function. Thus, even when the claims recite a "means for performing the function of . . . " or "step for performing the function of . . . ,"
if the claims also recite any structure, material or acts in support of that means or step, or that perform the recited function, then the provisions of 35 U.S.C. 112, 6 are not invoked. Moreover, even if the provisions of 35 U.S.C. 112, 6 are invoked to define the claimed inventions, it is intended that the inventions not be limited only to the specific structure, material or acts that are described in the preferred embodiments, but in addition, include any and all structures, materials or acts that perform the claimed function as described in alternative embodiments or forms of the invention, or that are well known present or later-developed, equivalent structures, material or acts for performing the claimed function.
APPENDIX A
// #
File: ExtractGRlntensities.cpp Creator: Waibhav Tembe Created: 2006-09-13 //
Copyright (c) 2006, Translational Genomics Research Institute $Id: ExtractGRlntensities.cpp,v 1.2 2007-09-15 01:14:46 nhomer Exp $
// #
#include "ExtractGRlntensities.h"

/* May want to change this */
#define FILE-NAME-LENGTH 512 /* The maximum number of beads for one snp on a given chip /* Selected arbitrarily based on current ILLMN chips #define MAX-NUM-BEADS 120 // Globals defined elwhere.
extern int VERBOSE;

int ExtractGRlntensities(string IlluminaFiles, string ExperimentType, string ExpDir, boot SuppressOutputFlag, string QuerySnp, int MagicNumber, int VersionNumber, uint32_t Header, int ChipType, int Normalize, int FilterLimit, double FilterStdDev, int FilterPercent) {

uint3Z_t.SNPCount=0;
string ErrorHeader="Aborting! Commandline input error:\n";
FILE *IlluminaFP;

int NoOfFiles = 0;
FILE *CurrentOutputFile=NULL;

char **InputFileNames;
char FileNameBuffer[FILE_NAME_LENGTH];
char OutputFileName[FILE_NAME_LENGTH];
int i;

int EntryCounter = 0;

int green_values[MAX_NUM_BEADS];
int red_values[MAX_NUM_BEADS];
int num_values = 0;

int cur_snp_code, prev_snp_code, cur_green_val, cur-red-vat;
boot SingleSnpMode = false;
int QuerySnpCode = 0;
int num_zero = 0;
int num_beads_filtered=0;
int num_beads_filtered_stddev=0;
int num_beads_filtered_zeros=0;
int num_snps_filtered=0;

double GreenMean, GreenStd, RedMean, RedStd;
int GreenMin, RedMin;
int GreenFilterPercent, RedFilterPercent, CombinedFilterPercent;
int NumEntries;
SnpEntry **Entries;

/* Check if we are to query a single snp if(strcmp(QuerySnp.c_str(), "SNP") != 0) {
SingleSnpMode = true;
QuerySnpCode = (int) (strtod(QuerySnp.c_str(), NULL) );
}

/* Create Output File Name */
strcpy(OutputFileName, "");
strcat(OutputFileName, IlluminaFiles.c_str());
strcat(OutputFileName, ".db");

"***********************************************õ
cout <<
<< endl;
cout << "Starting Extraction"
<< ends;
cout << "Outputting to << OutputFileName << endl;

string FileNameList=ExpDir+ExperimentType+"Files.txt";
ofstream FileNameListWriter;
FileNameListWriter.open(FileNameList.c_str(),fstream::out!
fstream::app);
FileNameListWriter<<OutputFileName<<endl;
FileNameListWriter.close();
/* CHANGE THIS */
string SnpNames = IlluminaFiles + ".db"+"SnpNames.txt";
ofstream SnpNamesFile;

/* Get the File Names for each strip if((IlluminaFP = fopen(IlluminaFiles.c_str(), "r"))==0) {
printf("IlluminaFiles %s could not be opened. Terminating!
\n", IlluminaFiles.c_str());
exit(0);
}
/* Get the number of files (strips) */
if(fscanf(IlluminaFP, "%d", &NoOfFiles)!=1) {
printf("Error! The first entry in %s must be the number of files listed."
11 Terminating!\n", IlluminaFiles.c_str());
exit(0);
}
assert(NoOfFiles>0);
/* Allocate Memory for File Names InputFileNames = (char **)malloc(sizeof(char*)*NoOfFiles);
for(i=0;i<NoOfFiles;i++) {
InputFileNames[i] = (char *)malloc(sizeof(char)*FILE_NAME_LENGTH);
}

/* Get the File Names for(i=0;i<NoOfFiles;i++) {
if(fscanf(IlluminaFP, "%s", FileNameBuffer)!=1) {
printf("Error! Could not read the %d files in %s.
"Terminating!\n", i+1, IlluminaFiles.c_str());
exit(0);
}
/* Copy File Name from Buffer */
strcpy(InputFileNames[i], FileNameBuffer);
}

/* Close IlluminFiles fclose(IlluminaFP);
if(! SuppressOutputFlag) {
/* Open Output File */
if((CurrentOutputFile = fopen(OutputFileName, "wb"))==0) {
printf("OutputFile (\"%s\") could not be opened.
Terminating!\n", OutputFileName);
exit(0);
}
else {
InitializeHeader(Current0utputFile, Header, SNPCount, ChipType, false, SingleSnpMode, Normalize);
// The following will store names of SNPs SnpNamesFile.open(SnpNames.c_str(), ofstream::out);
}
}
/* First get the number of Entries */
NumEntries = GetNumEntries(InputFileNames, NoOfFiles);
/* Allocate memory */
Entries = (SnpEntry**)malloc(sizeof(SnpEntry*)*NumEntries);
for(i=0;i<NumEntries;i++) {
Entries[i] = (SnpEntry*)malloc(sizeof(SnpEntry));
}

/* Read in values to Entries */
GetEntries(Entries, NumEntries, InputFileNames, NoOfFiles);
/* Sort them by SNPCODE */
SortEntries(Entries, 0, NumEntries-1, SNPCODE);
/* rescale based on minimum negative values */
GreenMin=O;
RedMin=O;

/* Check if we are to filter the values GetMeanAndMin(Entries, NumEntries, &GreenMean, &RedMean, &GreenMin, &RedMin);
if(FilterLimit>O II 1==Normalize II FilterPercent>O) {
GetStandardDeviation(Entries, NumEntries, &GreenStd, &RedStd, GreenMean, RedMean);
if(FilterLimit > 0 ) {
GetFilterPercent(Entries, NumEntries, FilterPercent, &GreenFilterPercent, &RedFilterPercent, &CombinedFilterPercent);
printf("FilterPercent Limits Green:%d\tRed:%d \tCombined:%d\n", GreenFilterPercent, RedFilterPercent, CombinedFilterPercent);
}
printf("Mean [Green, Red]:[%f\t%f]\n", GreenMean, RedMean);
printf("Standard Deviation: [Green, Red]:[%f\t%f]\n", GreenStd, RedStd);
/* Adjust values using min values GreenMean-=(GreenMin-1);
RedMean-=(RedMin-1);
GreenFilterPercent-=(GreenMin-1);
RedFilterPercent-=(RedMin-1);
CombinedFilterPercent-= (GreenMin-1);
CombinedFilterPercent-= (RedMin-1);
}
else {
GreenMean=O;
RedMean=O;
GreenStd = 0;
RedStd = 0;
}
prev_snp_code = -1;
num_values=0;
for(i=0;i<NumEntries;i++) {
cur_snp_code = (*(Entries[i])).SNPCODE;
cur_green_val = (*(Entries[i])).GREEN;
cur_red_val = (*(Entries[i])).RED;
cur_green_val-=(GreenMin-1); /* -1 makes sure no zero values cur_red_val-=(RedMin-1);
if(cur_snp_code < prev_snp_code) {

/* Should be in ascending order */
fprintf(stderr, "Error. The snp codes %d and %d are not in "increasing order. Terminating!\n", prev_snp_code, cur_snp_code);
fprintf(stderr, "Possible cause(s)\n"
"1. Files %s not sorted in ascending order of SNPcode, or\n"
"2. The order in which files were listed in the file %s"
11 is incorrect\n", InputFileNames[i], IlluminaFiles.c_str());
exit(O);
}
else if(cur_snp_code > prev_snp_code) {
/* all Values for preys SNP read */
if(num_values > 0) {
/* Write values to the output file and snp name to SnpNamesFile */
int NumWritten = WriteValues(prev_snp_code, num_values, green-values, red-values, SingleSnpMode, QuerySnpCode, SuppressOutputFlag, CurrentOutputFile, &SnpNamesFile, Normalize, GreenMean, RedMean, GreenStd, RedStd);
if(NumWritten > 0) {
SNPCount++;
}
num_beads_filtered_zeros += num_values -NumWritten;
}
else if(prev_snp_code != -1) {
num_snps_filtered++; /* Snp discarded because num_values=0 */
if (VERBOSE >= 1) cout<<"Filtered SNP =
"<<prev_snp_code<<endl ;
cout<<"Line read was <<cur_snp_code<<"\t"
<<cur_green_val<<"\t"
<<cur_red_val<<endl;
}
/* Reset the values as we start reading new snp num_values = 0;
prev_snp_code = cur_snp_code;
}

if(cur_green_val == 0 11 cur_red_val == 0) {
num_zero++;
}

/* Check if we must filter the values check if either:
/* 1 - the channels are below the filter limit or */
/* 2 - the channels are greater than FilterStdDev standard deviations away from the mean.
if((FilterLimit != 0) && ((cur_green_val < FilterLimit II
cur_red_val < FilterLimit))) {
num_beads_filtered++;
/* Skip the bead }
else if((FilterStdDev>0.0) && ((FilterStdDev <
fabs((((double)cur_green_val)-GreenMean)/GreenStd)) if (FilterStdDev<fabs((((double)cur_red_val)-RedMean)/RedStd)))) {
/* Skip the bead num_beads_filtered_stddev++;
}
else if((FilterPercent > 0) && (cur_green_val <= GreenFilterPercent II
cur_red_val <= RedFilterPercent 11 (cur_green_val+cur_red_val) <=
CombinedFilterPercent)) {
num_beads_filtered++;
/* Skip the bead }
else {
/* update the new values green_values[num_values] = cur_green_val;
red_volues[num_values] = cur_red_val;
num_values++;
EntryCounter++; /* Cumulative no. of values stored in .db files }
}

/* Do not forget the last SNP
if(num_values > 0) {

/* Write values to the output file and snp name to SnpNamesFile */
int NumWritten = WriteValues(prev_snp_code, num_values, green-values, red-values, SingleSnpMode, QuerySnpCode, Suppress0utputFlag, CurrentOutputFile, &SnpNamesFile, Normalize, GreenMean, RedMean, GreenStd, RedStd);
if(NumWritten > 0) {
SNPCount++;
}
num_beads_filtered_zeros += num_values - NumWritten;
}
else if(-1!=prev_snp_code) {
num_snps_filtered++;
if(VERBOSE >= 1) cout << "Filtered SNP at the End of file=
<< prev_snp_code << endl;
}
if (VERBOSE >= 1) cout << "Total number of entries stored in output file:
<< EntryCounter << endl;

if(! SuppressOutputFlag) {
/* Write the results to the header in the output file WriteResultsToHeader( CurrentOutputFile, Header, SNPCount, (int)GreenMean, (int)RedMean);
/* Close the current output file fclose(CurrentOutputFile);
/* Close the snp names file SnpNamesFile. close();
}
if(num_zero>O) {
cout << "***** WARNING: There were << num_zero << " intesity values that were zero."
<< endl;
}
if(FilterPercent > 0) {

cout << "Total no. beads filtered based on the Filter criterias of FilterLimit("
<< FilterLimit << ") and FilterPercent(Percent:
<< FilterPercent << GreenFilterPercent << " R: "
<< RedFilterPercent << " C: "
<< CombinedFilterPercent << ") is: "
<<num_beads_filtered << endl;
}
if(FilterStdDev > 0) {
cout << "Total no. beads filtered because the value was more than "
<< FilterStdDev << " standard deviations away:
<< num_beads_filtered_stddev << endl;
}
cout << "Total no. beads filtered because both channels were zero:
<< num_beads_filtered_zeros << endl;
cout << "Total no. snps filtered because there weren't enough beads: "
<< num_snps_filtered << endl;
cout << "Extraction Complete"
<< endl;
***********************************************
cout <<
<< endl;

/* Deallocate Memory for File names */
for(i=0;i<NoOfFiles;i++) {
free(InputFileNames[i]);
}
free(InputFileNames);
/* Deallocate Memory for Entries for(i=0;i<NumEntries;i++) {
free(Entries[i]);
}

free(Entries);

return 0;
}

int WriteValues(int prev_snp_code, int num_values, int *green_values, int *red_values, boot SingleSnpMode, int QuerySnpCode, boot Suppress0utputFlag, FILE *Current0utputFile, ofstream *SnpNamesFile, int Normalize, double GreenMean, double RedMean, double GreenStd, double RedStd) {

/*
NOTE: The snp code must be included since some files might be missing values for a given snp. Additionally the snp code is guaranteed to be no more than 10 digits long. Thus we can use a 32-bit integer.
*/

int j, k;
uint32_t temp_snp_code;
uintl6_t *temp;
double temp-double-green, temp-double-red;
int FoundNumValues=O;

Allocate Memory. We are to write the number of values and the values.
Thus, we need a total of 2*num_values + 1 indexes.
*/

temp = (uintl6_t *)malloc(sizeof(uintl6_t)*((2*num_values)+2));
We have read in all values for the snp, write to the file and reset the values.

if(num_values > 0 && prev_snp_code > 0 && (!SingleSnpMode II (SingleSnpMode &&
QuerySnpCode == prev_snp_code))) {

/* Write values to the output file and SnpNamesFile if(! SuppressOutputFlag && CurrentOutputFile) {

/* Only put in temp if they are nonzero and positigve for(j=0, k=0;j<num_values && k<num_values;j++) {
/* Normalize if necessary */
/* 1 = divide by mean. Others not implemented yet if(1 == Normalize) {
temp-double-green (green_values[j] / GreenMean);
temp-double-red = (red-values[j]
RedMean);

Since the green and red values are stored as integers in the .db file we must also store them as integers.
So multiply by some factor. This opens up a big issue of overflow values!

/* This code has been commented out because there were data points that were exceeding the uintl6 max * value.
if(temp_double_green>65 it temp_double_red>65) {
cout<<"Multiplier in normalization too big to fit in 16 bits!
<<"Valus are [red, green]= ["
<<GreenMean<<" , "<<RedMean<<"
<<green_values[j]<<"
"<<red_values[j]<<"
<<temp_double_green<<"
"<<temp_double_red<<" ]. Exiting!"
<<endl;
exit(1);
}

green_values[j]=(int) (temp_double_green*100);
red_values[j]=(int) (temp_double_red*100);
if(green_values[j]>65535) {
green-values[j] = 65535;
}

if (red-values[j]>65535) {
red-values[j] = 65535;
}

/* This code has been commented out because there were data points that were exceeding the uintl6 max * value.
if(green_vaiues[j]>65535 II
red_values[j]>65535) {
cout<<"Multiplier in normalization too big to fit in 16 bits!
<<"Valus are [red, green]= ["
<<red_values[j]<<"
"<<green_values[j]<<" ]. Exiting!"
exit(1);
}

}
if(green_values[j]<0 II red-values[j]<0) {
cout<<"Intensity values are negative.
<<"Valus are [red, green]=
[0 <<red_values[j]<<"
"<<green_values[j]<<" ]. Exiting!"
<<endl;
exit(1);
}
/* Write Green Value. Index in temp (1, 3, 5, ...) */
temp[2*(k)+1] = (uintl6_t)green_values[j];
/* Write Red Value. Index in temp (2, 4, 6, ...) */
temp[2*(k+1)] = (uintl6_t)red_values[j];

/* Only keep the values if at least one is greater than zero if(temp[2*(k)+l] > 0 II temp[2*(k+1)] > 0) {
/* Keep these values, update the number found */
k++;
FoundNumValues++;
}
}
/* Update the number of values found */
num_values = FoundNumValues;
if(num_values>0) {

Write to output file. Typecasting is **EXTREMELY** important to save space.

/* Write the snp code to the file */
temp_snp_code = (uint32_t)prev_snp_code;
temp_snp_code = htonl(temp_snp_code);
fwrite(&temp_snp_code, sizeof(uint32_t), 1, CurrentOutputFile);

/* Write the number of values to file */
assert(num_values != 0);
temp[o] = (uintl6_t)(2*num_values);

/* Convert to network byte order if writing to an output file. */
for(j=0;j<(2*num_values) + 1;j++) {
temp[j] = htons(temp[j]);
}
/* Write values to file fwrite(temp, sizeof(uintl6_t), (2*num_values)+1, CurrentOutputFile);

/* Write the snp code to the SnpNames file (*SnpNamesFile)<<prev_snp_code << "\t"
<< prev_snp_code << "\t"
<<num_values << endl;

}
}
/* Print the results if(VERBOSE > 1) {
cout << endl << prev_snp_code << "\tNoOfValues"
<< num_values << endl << "\tGreen\tRed\tGreen\tRed..."
<< endl;
for(j=0;j<num_values;j++) {
cout << "["<<green_values[j]
<< "\t"
<< red_values[j]
<< ">"
<< endl;
}
cout << endl;
}
}
/* Deallocate Memory. */
free(temp);

return num_values;
}

void GetMeanAndMin(SnpEntry **Entries, int NumEntries, double * GreenMean, double * RedMean, int * GreenMin, int RedMin) {

int i;
int cur_green_val, cur_red_val;
double green_mean = 0;
double red-mean = 0;
int green-min = 0;
int red-min = 0;
for(i=0;i<NumEntries;i++) {

cur_green_val = (*(Entries[i])).GREEN;
cur_red_val = (*(Entries[i])).RED;
green-mean += cur-green-val;

red-mean += cur-red-val;
green_min=(green_min<cur_green_val)?
green_min:cur_green_val;
red_min=(red_min<cur_red_val)?red_min:cur_red_val;
}

/* Return the means */
(*GreenMean) = (green_mean/((double)NumEntries));
(*RedMean) = (red_mean/((double)NumEntries));
/* Return the mins */
(*GreenMin) = green-min;
(*RedMin) = red-min;
}
void GetStandardDeviation(SnpEntry **Entries, int NumEntries, double *GreenStandardDeviation, double *RedStandardDeviation, double GreenMean, double RedMean) {
int i;
int cur_green_val, cur_red_val;
/* sum of squares of the channel double green_ss = 0;
double red_ss = 0;
for(i=0;i<NumEntries;i++) {

cur_green_val = (*(Entries[i])).GREEN;
cur_red_val = (*(Entries[i])).RED;
green_ss += (cur_green_val - GreenMean)*(cur_green_val -GreenMean);
red_ss += (cur_red_val - RedMean)*(cur_red_val - RedMean);
}
/* Calculate sample standard deviation (*GreenStandardDeviation) = sqrt( green_ss / (NumEntries - 1) );
(*RedStandardDeviation) = sqrt( red_ss / (NumEntries - 1) );
}
void GetFilterPercent(SnpEntry **Entries, int NumEntries, int FilterPercent, int *GreenFilterPercent, int *RedFilterPercent, int *CombinedFilterPercent) {
int index = (int)((double)NumEntries*FilterPercent)/100;

if(FilterPercent == 0) {
return;
}

fprintf(stderr, "Currently finding FilterPercent limits. This may take a while ...\n");
assert(FilterPercent>=O && FilterPercent <= 100);
/* By Green */
SortEntries(Entries, 0, NumEntries-1, GREEN);
(*GreenFilterPercent) = (*(Entries[index])).GREEN;
/* By Red */
SortEntries(Entries, 0, NumEntries-1, RED);
(*RedFilterPercent) = (*(Entries[index])).RED;
/* By Combined */
SortEntries(Entries, 0, NumEntries-1, COMBINED);
(*CombinedFilterPercent) = (*(Entries[index])).GREEN +
(*(Entries[index])).RED;

/* Revert back to sort by SNPCODE
SortEntries(Entries, 0, NumEntries-1, SNPCODE);
}

int GetNumEntries(char **InputFileNames, int NoOfFiles) {
FILE *CurrentInputFile;
int i;
int cur_snp_code, cur_green_val, cur_red_val;
int ctr = 0;

/* Used to validate format for the input file char format_Code[50], format_Grn[50], format_Red[50];
/* Go through each file and read in values.
for(i=0;i<NoOfFiles;i++) {
/* Open Input File */
if((CurrentInputFile = fopen(InputFileNames[i], "r"))==0) {
printf("IlluminaFiles (\"%s\") could not be opened.
Terminating!\n", InputFileNames[i]);

exit(0);
}
/* Check the Header of the Input File*/
if(fscanf(CurrentInputFile,"%s %s %s", format-Code, format_Grn, format-Red)!=3) {
printf("wrong input. Input was %s %s %s.
Terminating!\n", format-Code, format_Grn, format-Red);
exit(0);
}
else if(!(strcmp(format_Code,"Code")==0) Il !(strcmp(format_Grn, "Grn") 0) fl !(strcmp(format_Red, "Red") 0)) {
printf("wrong input. Input was %s %s %s.
Terminating!\n", format-Code, format_Grn, format-Red);
exit(0);
}

/* Read in the values from file */
while(fscanf(CurrentInputFile, "%d %d %d", &cur_snp_code, &cur_green_val, &cur_red_val) 3) {
ctr++;
}

/* Close Input File fclose(CurrentInputFile);
}

return ctr;
}

void GetEntries(SnpEntry **Entries, int NumEntries, char **InputFileNames, int NoOfFiles) {
FILE *CurrentInputFile;
int i;
int cur_snp_code, cur_green_val, cur_red_val;
int ctr = 0;

/* Used to validate format for the input file char format_Code[50], format_Grn[50], format_Red[50];

/* Go through each file and read in values.

for(i=0;i<NoOfFiles;i++) {
/* Open Input File if((CurrentInputFile = fopen(InputFileNames[i], "r"))==0) {
printf("IlluminaFiles (\"%s\") could not be opened.
Terminating!\n", InputFileNames[i]);
exit(0);
}
/* Check the Header of the Input File*/
if(fscanf(CurrentInputFile,"%s %s %s", format-Code, format_Grn, format-Red)!=3) {
printf("wrong input. Input was %s %s %s.
Terminating!\n", format-Code, format_Grn, format-Red);
exit(0);
}
else if(!(strcmp(format_Code,"Code")==0) II !(strcmp(format_Grn, "Grn") 0) II !(strcmp(format_Red, "Red") 0)) {
printf("wrong input. Input was %s %s %s.
Terminating!\n", format-Code, format_Grn, format-Red);
exit(0);
}

/* Read in the values from file */
while(fscanf(CurrentInputFile, "%d %d %d", &cur_snp_code, &cur_green_val, &cur_red_val) 3) {
assert(ctr<NumEntries);
(*(Entries[ctr])).SNPCODE = cur_snp_code;
(*(Entries[ctr])).GREEN = cur_green_val;
(*(Entries[ctr])).RED = cur_red_val;
ctr++;
}
/* Close Input File */
fclose(CurrentInputFile);
}
assert(ctr==NumEntries);

}

void SortEntries(SnpEntry **Entries, int low, int high, int Field) {
/* MergeSort!
int mid = (low + high)/2;
int start-upper = mid + 1;
int end-upper = high;
int start-lower = low;
int end-lower = mid;
int ctr, i;
SnpEntry **temp_entries;
if(low >= high) {
return;
}

/* Partition the list into two lists and then sort them recursively SortEntries(Entries, low, mid, Field);
SortEntries(Entries, mid+1, high, Field);

temp-entries = (SnpEntry**)malloc(sizeof(SnpEntry*)*(high-low+1))-/* Merge the two lists ctr = 0;
while( (start_lower<=end_lower) && (start_upper<=end_upper) ) {
if( CompareSnpEntry(Entries[start_lower], Entries[start_upper], Field) <= 0) {
temp_entries[ctr] = Entries[start_lower];
start_lower++;
}
else {
temp_entries[ctr] = Entries[start_upper];
start_upper++;
IF
ctr++;
IF
if(start_lower<=end_lower) {
while(start_lower<=end_lower) {
temp_entries[ctr] = Entries[start_lower];
ctr++;
start_lower++;
}
}

else {
while(start_upper<=end_upper) {
temp_entries[ctr] = Entries[start_upper];
ctr++;
start_upper++;
}
}
for(i=low, ctr=0;i<=high;i++, ctr++) {
Entries[i] = temp_entries[ctr];
/* Check to see if we sorted properly - ascending */
if(ctr>O && CompareSnpEntry(Entries[i-1], Entries[i], Field) > 0) {
fprintf(stderr, "Sorted improperly\n");
fprintf(stderr, "low:%d\thigh:%d\n", low, high);
fprintf(stderr, "i-1:%d\ti:%d\tField:%d\n", i-1, i, Field);
fprintf(stderr, "i-1:\t%d\t%d\t%d\n", (*(Entries[i-1])).SNPCODE, (*(Entries[i-1])).GREEN, (*(Entries[i-1])).RED);
fprintf(stderr, "i:\t%d\t%d\t%d\n", (*(Entries[i])).SNPCODE, (*(Entries[i])).GREEN, (*(Entries[i])).RED);
exit(1);
}
}

free(temp_entries);
}

int CompareSnpEntry(SnpEntry *E1, SnpEntry *E2, int Field) {
switch(Field) {
case SNPCODE:
if((*E1).SNPCODE > (*E2).SNPCODE) {
return 1;
}
else if((*E1).SNPCODE == (*E2).SNPCODE) {
return 0;
}
else {
return -1;
}
break;

case GREEN:
if((*E1).GREEN > (*E2).GREEN) {
return 1;
}
else if((*E1).GREEN == (*E2).GREEN) {
return 0;
}
else {
return -1;
}
break;
case RED:
if((*E1).RED > (*E2).RED) {
return 1;
}
else if((*E1).RED == (*E2).RED) {
return 0;
}
else {
return -1;
}
break;
case COMBINED:
int combinedl = (*E1).RED + (*E1).GREEN;
int combined2 = (*E2).RED + (*E2).GREEN;
if(combinedl > combinedl) {
return 1;
}
else if(combinedl == combinedl) {
return 0;
}
else {
return -1;
}
break;
default:
fprintf(stderr, "Error in CompareSnpEntry in ExtractGRlntensities.cpp\n");
break;
}
return 0;
}

/*
A different name so that it does not clash with system supported error headers (if any!) #include <stdio.h>
#include <stdlib.h>
#include "DError.h"
enum {Exit, Warn, LastActionType};
enum {
Dummy, OutofRange, /* e.g. command line args IllegalValue, /* e.g. negative probe intensity IllegalFileName, /* KeepAdding */
LastErrorType static char ErrorString[][20]=
{ "\O", "OutOfRange", "IllegalValue", "IllegalFileName"};
static char ActionType[][20]={"Fatal Error", "Warning"};
void PrintError(char* FunctionName, char *VariableName, char* Message, int type, int Action) {

fprintf(stderr, "\nIn function \"%s\": %s[%s]. ", FunctionName, ActionType[Action], ErrorString[type]);
Based on the type of error, variable may or may not have a "printable" string if(VariableName) fprintf(stderr, "VariableName: %s ", VariableName);
fprintf(stderr, "\n");

switch(Action) {
case Exit:
fprintf(stderr, " ***** Exiting due to errors *****\n");
exit(0);
break; /* Not necessary actually! */
case Warn:
return; break;
default:
fprintf(stderr, "Trouble!!!\n");

}
}

#include <iostream.h>
#include <stdlib.h>
enum {Little, Big};
int CheckEndian() {
int i=0x12345678;
if (*(char*)&i==0x12 ) {
//cout<<"Big endian"<<endl;
return Big;
}
else if (*(char*)&i==0x78 ) {
//cout<<"Little endian"<<endl;
return Little;
}
else {
cout<<"You invented a new architecture! Congratulations. Start a company."<<endl;
exit(0);
}
}

/* Required to call R code */
#define MATHLIB-STANDALONE 1 #include "Rmath.h"

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
#include <string.h>
#ifdef HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif #ifdef HAVE_SYS_SOCKET_H
#include <sys/socket.h>
#endif #ifdef HAVE_NETINET_H
#inctude<netinet/in.h>
#endif #include "Analyze.h"
void Analyze( char *MixtureOfInterestFileName, char *PopRefListFileName, char *PeopleOfInterestFileName, int TestStatistic, int NormalizeTestStatistic, char *SnpNamesFileName, int NormalizeChips, int MeanNormalize, char *MeanPeopleListFileName, char *OutputFileName, int. DistanceMeasure, int CorrelationXDistance, int CorrelationYDistance, int PrintSummary) {
int i;
int NumSnps=O;
SnpEntry * MixtureEntries; /* Stores the pool we wish inspect char **PopRefFileNames;
int *PopRefChipTypes;
int NumPopRef=O;
double *PopRefTestStatistics;
char **PeopleFileNames;

int *PeopleChipTypes;
int NumPeople=0;
double *PeopleTestStatistics;
char **MeanPeopleFileNames;
int *MeanPeopleChipTypes;
int NumMeanPeople=0;
double *MeanPeopleTestStatistics;

double *PopulationMean; /* Mean Frequency of the Population int *SnpNames;

/* Get the SnpNames and the number of Snps GetSnpNames(&SnpNames, &NumSnps, SnpNamesFileName);
/* Get the pool of interest */
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Reading in from %s\n", MixtureOflnterestFileName);
GetMixtureEntries(&MixtureEntries, &SnpNames, NumSnps, MixtureOflnterestFileName);
fprintf(stderr, "%s", BREAK_LINE);
assert(NumSnps>1);
/* Get the population reference file names and chip types */
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Reading in from %s\n", PopRefListFileName);
GetFileNames(&NumPopRef, PopRefListFileName, &PopRefFileNames, &PopRefChipTypes);
fprintf(stderr, "%s", BREAK_LINE);
assert(NumPopRef>1);
if(MeanNormalize==1) {
/* Get the mean people file names and chip types fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Reading in from %s\n", MeanPeopleListFileName);
GetFileNames(&NumMeanPeople, MeanPeopleListFileName, &MeanPeopleFileNames, &MeanPeopleChipTypes);
fprintf(stderr, "%s", BREAK_LINE);
assert(NumMeanPeople>0);
}

/* Get the people of interest file names and chip types fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Reading in from %s\n", PeopleOflnterestFileName);

GetFileNames(&NumPeople, PeopleOflnterestFileName, &PeopleFileNomes, &PeopleChipTypes);
fprintf(stderr, "%s", BREAK_LINE);
assert(NumPeople>0);
/* Precompute the mean allele frequency for the population PopulationMean = (double*)malloc(sizeof(double)*NumSnps);
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Getting Population Mean\n");
GetPopulationMean(&MixtureEntries, NumSnps, NumPopRef, PopRefFileNames, &PopRefChipTypes, &PopulationMean, &SnpNames, NormalizeChips);
fprintf(stderr, "%s", BREAK_LINE);

/* Compute Statistic Distribution and Reference Population */
/* First get all Test Statistics */
PopRefTestStatistics = (double*)malloc(sizeof(double)*NumPopRef);
PeopleTestStatistics = (double*)malloc(sizeof(double)*NumPeople);
if(MeanNormalize==1) {
MeanPeopleTestStatistics =
(double*)malloc(sizeof(double)*NumMeanPeople);
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Computing Mean People Test Statistics\n");
GetTestStatistics(&MixtureEntries, NumMeanPeople, MeanPeopleFileNames, &MeanPeopleChipTypes, NumSnps, &SnpNames, &PopulationMean, TestStatistic, &MeanPeopleTestStatistics, DistanceMeasure, CorrelationXDistance, CorrelationYDistance, 0, NormalizeChips);
fprintf(stderr, "%s", BREAK_LINE);
}
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Computing Reference Population Test Statistics \n");
GetTestStatistics(&MixtureEntries, NumPopRef, PopRefFileNames, &PopRefChipTypes, NumSnps, &SnpNames, &PopulationMean, TestStatistic, &PopRefTestStatistics, DistanceMeasure, CorrelationXDistance, CorrelationYDistance, 0, NormalizeChips);
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Computing People of Interest Test Statistics\n");
GetTestStatistics(&MixtureEntries, NumPeople, PeopleFileNames, &PeopleChipTypes, NumSnps, &SnpNames, &PopulationMean, TestStatistic, &PeopleTestStatistics, DistanceMeasure, CorrelationXDistance, CorrelationYDistance, 0, NormalizeChips);
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Computing Statistics and Outputting\n");
ComputeStatistics( &PopRefTestStatistics, NumPopRef, PopRefFileNames, &PeopleTestStatistics, NumPeople, PeopleFileNames, &MeanPeopleTestStatistics, NumMeanPeople, MeanPeopleFileNames, NormalizeTestStatistic, MeanNormalize, PrintSummary, OutputFileName);
fprintf(stderr, "Outputted to %s\n", OutputFileName);
fprintf(stderr, "%s", BREAK_LINE);

/* Deallocate Memory fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Cleaning Up\n");
free(MixtureEntries);

if(MeanNormalize==1) {
for(i=0;i<NumMeanPeople;i++) {
free(MeanPeopleFileNames[i]);
IF
free(MeanPeopleFileNames);
free(MeanPeopleChipTypes);
}

for(i=0;i<NumPeople;i++) {
free(PeopleFileNames[iii);
IF
free(PeopleFileNames);
free(PeopleChipTypes);
for(i=0;i<NumPopRef;i++) {
free(PopRefFileNames[i]);
IF
free(PopRefFileNames);
free(PopRefChipTypes);
free(PopulationMean);
free(SnpNames);

free(PopRefTestStatistics);
free(PeopleTestStatistics);
fprintf(stderr, "Terminating Successfully\n");
fprintf(stderr, "%s", BREAK_LINE);
}

void GetSnpNames(int **SnpNames, int *NumSnps, char *SnpNameFileName) {
int i;
FILE *Fp;
int curSnpId;

/* First get the number of Snps */
if(!(Fp=fopen(SnpNameFileName, "r"))) {
fprintf(stderr, "Error opening %s in GetSnpNames.
Terminating!\n", SnpNameFileName);
exit(1);
}
(*NumSnps) = 0;
while(fscanf(Fp, "%d", &curSnpId) == 1) {
(*NumSnps)++;
}
fclose(Fp);
/* Allocate memory (*SnpNomes) = (int*)malloc(sizeof(int)*(*NumSnps));
/* Read in Snp Names */
if(!(Fp=fopen(SnpNameFileName, "r"))) {
fprintf(stderr, "Error opening %s in GetSnpNames.
Terminating!\n", SnpNameFileName);
exit(1);
}
i=0;
while(fscanf(Fp, "%d", &curSnpId) == 1) {
(*SnpNames)[i] = curSnpId;
i++;
}
assert(i==(*NumSnps));
fclose(Fp);
}
void GetMixtureEntries(SnpEntry **MixtureEntries, int **SnpNames, int NumSnps, char *MixtureOfInterestFileName) {
int t = ReadChiplntoEntries(MixtureEntries, SnpNames, NumSnps, MixtureOflnterestFileName);
assert(t > 0);
}

void GetFileNames(int *NumEntries, char *FileName, char ***FileNames, int **ChipTypes) {
int i;
FILE *Fp;
int curChipType;
char curFileName[MAX_FILENAME_LENGTH];
/* First get the number of files if(!(Fp=fopen(FileName, "r"))) {

fprintf(stderr, "Error opening %s in GetEntries.
Terminating!\n", FileName);
exit(l);
}
(*NumEntries) = 0;
while(fscanf(Fp, "%d %s", &curChipType, curFileName) == 2) {
(*NumEntries)++;
}
assert((*NumEntries)>0);
fclose(Fp);

(*FileNames) = (char**)malloc(sizeof(char*)*(*NumEntries));
(*ChipTypes) = (int*)malloc(sizeof(int)*(*NumEntries));

/* Next read in the entries */
if(!(Fp=fopen(FileName, "r"))) {
fprintf(stderr, "Error opening %s in GetEntries.
Terminating!\n", FileName);
exit(1);
}
i=0;
while(fscanf(Fp, "%d %s", &curChipType, curFileName) 2) {
(*ChipTypes)[i] = curChipType;
(*FileNames)[i] =
(char*)malloc(sizeof(char)*MAX_FILENAME_LENGTH);
strcpy((*FileNames)[i], curFileName);
}

fclose(Fp);
}

int ReadChiplntoEntries(SnpEntry **Entries, int **SnpNames, int NumSnps, char *FileName) {

int i, j;
FILE *Fp;
struct TGenHeader header;
int tempNumSnps;
int tempSnpCode;
int tempNumProbePairs;
unsigned short int *tempProbePoirs;
int curlndex;

int prevSnpCode=-l;
int ctr;

/* Allocate Memory */
(*Entries) = (SnpEntry*)malloc(sizeof(SnpEntry)*NumSnps);
/* Open for reading a binary file if(!(Fp=fopen(FileName, "rb"))) {
fprintf(stderr, "Error opening %s in GetEntries.
Terminating!\n", FileName);
exit(1);
}
/* Read Header */
ReadHeader(Fp, &header);
tempNumSnps = header.SNPCount;
assert(header.ProcessMMFlag 0);
assert(header.SingleSnpMode == 0);

assert(header.Normalize == 0);
/* Read in entries curIndex=0;
ctr=0;
for(i=0;i<tempNumSnps;i++) {
/* Read in Snp Code */
fread(&tempSnpCode, sizeof(int), 1, Fp);
tempSnpCode = (int)ntohl(tempSnpCode);
/* Read in the number of probes */
fread(&tempNumProbePairs, sizeof(unsigned short int), 1, Fp);
tempNumProbePairs = ntohs(tempNumProbePairs);
assert(tempNumProbePairs!=0);
if(tempNumProbePairs > 0) {
/* Read in the probe pairs */
tempProbePairs = (unsigned short int*)malloc(sizeof(unsigned short int)*tempNumProbePairs);
if(fread(tempProbePairs, sizeof(unsigned short int), tempNumProbePairs, Fp) != tempNumProbePairs) {
fprintf(stderr, "Error reading in the %dth snp from %s. Terminating!\n", i, FileName);
exit(1);

}

/* Check that snp codes are increasing */
if(prevSnpCode >= tempSnpCode) {
fprintf(stderr, "prevSnpCode:%d \ttempSnpCode:%d\n", prevSnpCode, tempSnpCode);
fprintf(stderr, "FileName:%s\n", FileName);

assert(prevSnpCode<tempSnpCode);
prevSnpCode = tempSnpCode;
/* Check that we are on the right entry while((*SnpNames)[curIndex] < tempSnpCode &&
curIndex<NumSnps) {
curIndex++;
}
/* Only read in if we have the some snp code if(curIndex < NumSnps && tempSnpCode (*SnpNames) [curIndex]) {
assert(curIndex<NumSnps);
assert((*SnpNames)[curIndex]
tempSnpCode);

assert(tempNumProbePairs%2==0);
double Mean =0.0;
for(j=0; j<tempNumProbePairs/2;j++) {
double green =
ntohs(tempProbePairs[2*j]);
double red =
ntohs(tempProbePairs[2*j + 1]);
Mean += red/(red+green);
}
Mean /_ (tempNumProbePairs/2);
(*Entries)[curIndex].ProbeMean = Mean;
ctr++;
}
/* Deallocate temporary memory free(tempProbePairs);
}
}
if(ctr<=0) {
fprintf(stderr, "Error. Did not read any probe values from %s\n", FileName);
assert(ctr>0);
}

fclose(Fp);
return ctr;
}

void ReadHeader(FILE* fp, struct TGenHeader *h) /* Return the header byte of gpb files unsigned int Header;
fread(&Header, sizeof(unsigned int), 1, fp);
Header=ntohl(Header);
printf("Header is %d.\n", Header); */
/* read magic number as first two bytes (*h).MagicNumber=(unsigned int)Header >> 16;
/* Read the SNPCount as the second byte */
fread(&((*h).SNPCount), sizeof(unsigned int), 1, fp);
(*h).SNPCount=ntohl((*h).SNPCount);
/* Read the ChipType as the third byte */
fread(&((*h).ChipType), sizeof(unsigned int), 1, fp);
(*h).ChipType=ntohl((*h).ChipType);
/* Read the ProcessMMFlag as the fourth byte fread(&((*h).ProcessMMFlag), sizeof(unsigned int), 1, fp);
(*h).ProcessMMFlag=ntohl((*h).ProcessMMFlag);
/* Read the SingleSnpMode as the fifth byte */
fread(&((*h).SingleSnpMode), sizeof(unsigned int), 1, fp);
(*h).SingleSnpMode=ntohl((*h).SingleSnpMode);
/* Read the Normalize as the sixth byte */
fread(&((*h).Normalize), sizeof(unsigned int), 1, fp);
(*h).Normalize=ntohl((*h).Normalize);
/* Read the Avearge PMA as the seventh byte fread(&((*h).AverageChannel1), sizeof(unsigned int), 1, fp);
(*h).AverageChannell=ntohl((*h).AverageChannel1);
/* Read the Average PMB as the eight byte */
fread(&((*h).AverageChannelz), sizeof(unsigned int), 1, fp);
(*h).AverageChannelz=ntohl((*h).AverageChannel2);
}

int ReadGenotypeslntoEntries(SnpEntry **Entries, int **SnpNames, int NumSnps, char *FileName) {
int i;
FILE *Fp;
int tempSnpId;
char tempGenotype;
int curIndex;
int prevSnpId=-1;

/* Next allocate memory for the Genotypes (*Entries) = (SnpEntry*)malloc(sizeof(SnpEntry)*(NumSnps));
/* Next read in entries */
if(!(Fp=fopen(FileName, "r"))) {
fprintf(stderr, "Error opening %s in GetEntries.
Terminating!\n", FileName);
exit(1);
}
i=0;
curIndex=0;
while(fscanf(Fp, "%d %c", &tempSnpId, &tempGenotype) == 2) {
assert(tempSnpId > prevSnpId);
while(curIndex < NumSnps && (*SnpNames)[curIndex] <
tempSnpId) {
curIndex++;
}
if(curIndex < NumSnps && tempSnpId == (*SnpNames) [curIndex]) {
if(tempGenotype == 'A') {
(*Entries)[curIndex].ProbeMean= 0.0;
}
else if(tempGenotype = 'H') {
(*Entries)[curIndex].ProbeMean = 0.5;
}
else if(tempGenotype == 'B') {
(*Entries)[curIndex].ProbeMean = 1.0;
}
else if(tempGenotype == 'N') {
(*Entries)[curIndex].ProbeMean = -1.0;
}
else {
fprintf(stderr, "Could not read the %dth genotype in %s. Terminating!\n", i, FileName);
exit(1);
}
}
i++;
prevSnpId = tempSnpId;
}
fclose(Fp);

return NumSnps;
}
void GetTestStatistics(SnpEntry **MixtureEntries, int NumPeople, char **Peopl.eFileNames, int **PeopleChipTypes, int NumSnps, int **SnpNames, double **PopulationMean, int TestStatistic, double **TestStatistics, int DistanceMeasure, int CorrelationXDistance, int CorrelationYDistance, int Output, int NormalizeChips) {
int i;
SnpEntry *PeopleEntries;
char Rotate[]="II**/*/*++--**\\++";
int NumSnpsUsed=O;
int TotalNumSnpsUsed=O;
if(Output==1) {
fprintf(stderr, "Warning! The output flag is to be used for debugging only .\n");.
}
fprintf(stderr, "Out of %d, currently on:\t %6d", NumPeople, 0);
for(i=0;i<NumPeople;i++) {
fprintf(stderr, "\b\b\b\b\b\b\b\b%c %6d", Rotate[(i+1)%16], i+1);
if((*PeopleChipTypes)[i] == 0) {
/* From a gpb file */
int curNumSnps =
ReadChiplntoEntries(&PeopleEntries, SnpNames, NumSnps, PeopleFileNames[i]);
if(NormalizeChips==1) {
NormalizeChip(&PeopleEntries, NumSnps, MixtureEntries);
}
assert(curNumSnps > 0);
}
else if((*PeopleChipTypes)[i] == 1) {
/* From a genotype file */
int curNumSnps =

ReadGenotypeslntoEntries(&PeopleEntries, SnpNames, NumSnps, PeopleFileNames[i]);
assert(curNumSnps > 0);
}
else {
fprintf(stderr, "Error. Incorrect chiptype specified for file %s. Terminating!\n", PeopleFileNames[i]);
exit(l);
}
(*TestStatistics)[i] = ComputeStatistic(MixtureEntries, NumSnps, &PeopleEntries, PopulationMean, TestStatistic, DistanceMeasure, CorrelationXDistance, CorrelationYDistance, Output, &NumSnpsUsed);
TotalNumSnpsUsed+=NumSnpsUsed;
free(PeopleEntries);
}
fprintf(stderr, "\n");
fprintf(stderr, "On average, the number of Snps used for computing this group's Test Statistic was %lf\n", (double)(1.0*TotalNumSnpsUsed)/NumPeople);
}

void GetPopulationMean(SnpEntry **MixtureEntries, int NumSnps, int NumPopRef, char **PopRefFileNames, int **PopRefChipTypes, double **PopulationMean, int **SnpNames, int NormalizeChips) {
int i, j;
SnpEntry *PopRefEntries;
int *ctr;
char Rotate[]="II**/*/*++--**\\++";

ctr = (int*)malloc(sizeof(int)*NumSnps);
/* Initialize PopulationMean for(i=0;i<NumSnps;i++) {
(*PopulationMean)[i] = 0.0;
ctr[l] = 0;
}
/* Now Get sum of all population fprintf(stderr, "Out of %d, currently on:\t %6d", NumPopRef, 0);
for(i=0;i<NumPopRef;i++) {
fprintf(stderr, "\b\b\b\b\b\b\b\b%c %6d", Rotate[(i+1)%16], i+1);
if((*PopRefChipTypes)[i] == 0) {
/* From a gpb file int curNumSnps =
ReadChiplntoEntries(&PopRefEntries, SnpNames, NumSnps, PopRefFileNames[i]);
assert(curNumSnps > 0);
if(NormalizeChips==1) {
NormalizeChip(&PopRefEntries, NumSnps, MixtureEntries);
}
}
else if((*PopRefChipTypes)[i] == 1) {
/* From a genotype file */
int curNumSnps =
ReadGenotypeslntoEntries(&PopRefEntries, SnpNames, NumSnps, PopRefFileNames[i]);
assert(curNumSnps > 0);
}
else {
fprintf(stderr, "Error. Incorrect chiptype specified for file %s. Terminating!\n", PopRefFileNames[1]);
exit(l);
}
/* Now add to Population Mean */
for(j=0;j<NumSnps;j++) {
if(PopRefEntries[j].ProbeMean>=0.0) {
(*PopulationMean)[j] +_ PopRefEntries[j].ProbeMean;

ctr[j]++;
}
}

free(PopRefEntries);
}
fprintf(stderr, "\n");
for(i=0;i<NumSnps;i++) {
(*PopulationMean)[i] (double)ctr[i];
}

free(ctr);
}

void NormalizeChip(SnpEntry **PeopleEntries, int NumSnps, SnpEntry **MixtureEntries) {
int i;
int poolCtr=O;
int peopleCtr=O;
int *PeopleEntriesOrder;
int *MixtureEntriesOrder;
SnpEntry *tempEntries;

/* Basically make sure that the distribution of the pool Entries is the same as the MixtureEntries */

/* First sort the Mixture Entries so that we get the correct distribution order */
MixtureEntriesOrder = (int*)malloc(sizeof(int)*NumSnps);
for(i=0;i<NumSnps;i++) {
MixtureEntriesOrder[i] = i;
If SortSnpEntries(MixtureEntries, &MixtureEntriesOrder, 0, NumSnps-1);
/* sort the people entries to get the order */
PeopleEntriesOrder = (int*)malloc(sizeof(int)*NumSnps);
for(i=0;i<NumSnps;i++) {
PeopleEntriesOrder[i] = i;
If SortSnpEntries(PeopleEntries, &PeopleEntriesOrder, 0, NumSnps-1);
/* Now match distributions /* Skip over -1's */
poolCtr=O;

while(poolCtr<NumSnps && (*MixtureEntries)[poolCtr].ProbeMean < 0) {
poolCtr++;
}
assert(poolCtr<NumSnps);
peopleCtr=O;
while(peopleCtr<NumSnps && (*PeopleEntries)[peopleCtr].ProbeMean <
0) {
peopleCtr++;
}
/* match distributions */
while(poolCtr<NumSnps && peopleCtr<NumSnps) {
assert((*MixtureEntries)[poolCtr].ProbeMean >= 0.0);
assert((*PeopleEntries)[peopleCtr].ProbeMean >= 0.0);
(*PeopleEntries)[peopleCtr].ProbeMean = (*MixtureEntries) [poolCtr].ProbeMean;
poolCtr++;
peopleCtr++;
}

/* Restore pool order tempEntries = (SnpEntry*)malloc(sizeof(SnpEntry)*NumSnps);
for(i=0;i<NumSnps;i++) {
assert(MixtureEntriesOrder[i] >= 0);
assert(MixtureEntriesOrder[i] < NumSnps);
tempEntries[i] _ (*MixtureEntries)[MixtureEntriesOrder[i]];
}
for(i=0;i<NumSnps;i++) {
(*MixtureEntries)[i] = tempEntries[i];
}
/* restore people order for(i=0;i<NumSnps;i++) {
assert(PeopleEntriesOrder[i]>=0);
assert(PeopleEntriesOrder[i]<NumSnps);
tempEntries[i] _ (*PeopleEntries)[PeopleEntriesOrder[i]];
}
for(i=0;i<NumSnps;i++) {
(*PeopleEntries)[i] = tempEntries[i];
}

/* Free Memory free(tempEntries);
free(MixtureEntriesOrder);
free(PeopleEntriesOrder);
}
void SortSnpEntries(SnpEntry ** Entries, int ** EntriesOrder, int low, int high) {
/* MergeSort! */
int mid = (low + high)/2;
int start-upper = mid + 1;
int end-upper = high;
int start-lower = low;
int end-lower = mid;
int ctr, i;
SnpEntry * temp-entries;
int * temp-order;
if(low >= high) {
return;
}

/* Partition the list into two lists and then sort them recursively SortSnpEntries(Entries, EntriesOrder, low, mid);
SortSnpEntries(Entries, EntriesOrder, mid+1, high);

temp-entries = (SnpEntry *)malloc(sizeof(SnpEntry)*(high-low+1))=
temp-order = (int*)malloc(sizeof(int)*(high-low+1));

/* Merge the two lists ctr = 0;
while( (start_lower<=end_lower) && (start_upper<=end_upper) ) {
if( (*Entries)[start_lower].ProbeMean <= (*Entries) [start_upper].ProbeMean ) {
temp_entries[ctr] = (*Entries)[start_lower];
temp_order[ctr] = (*EntriesOrder)[start_lower];
start_lower++;
}
else {
temp_entries[ctr] = (*Entries)[start_upper];
temp_order[ctr] = (*EntriesOrder)[start_upper];
start_upper++;
}
ctr++;
}
if(start_lower<=end_lower) {
while(start_lower<=end_lower) {
temp_entries[ctr] = (*Entries)[start_lower];
temp_order[ctr] = (*EntriesOrder)[start_lower];
ctr++;

start_lower++;
}
}
else {
while(start_upper<=end_upper) {
temp_entries[ctr] = (*Entries)[start_upper];
temp_order[ctr] = (*EntriesOrder)[start_upper];
ctr++;
start_upper++;
}
}
for(i=low, ctr=0;i<=high;i++, ctr++) {
(*Entries)[i] = temp_entries[ctr];
(*EntriesOrder)[i] = temp_order[ctr];
/* Check to see if we sorted properly assert(ctr<=O II (*Entries)[i-1].ProbeMean<=(*Entries) [i].ProbeMean);
}
free(temp_entries);
free(temp_order);
}

#include "ExtractPMlntensities.h"
#include <algorithm>

// Globals defined elsewhere extern unsigned int DICEPHIACODE;
extern int VERBOSE;

struct to-lower I
int operator() ( int ch ) {
return std::tolower ( ch );
}

int ExtractPMlntensities( string CelFileName, string CdfFileName, string CdfFileType, string ExperimentType, string ExpDir, boot SuppressOutputFlag, boot ProcessMMFlag, string QuerySnp, int MagicNumber, int VersionNumber, uint32_t Header, int ChipType, int Normalize, int FilterPercent) {
uint32_t SNPCount=0;
uint32_t prevSNPCounter=0;
uint32_t SNPCounter=0;
double PMAMean, PMAStd, PMBMean, PMBStd;
int PMAFilterPercent=0;
int PMBFilterPercent=O;
int CombinedFilterPercent=0;
int PMAMin, PMBMin;
int num_probes_filtered = 0;
int num_snps_filtered = 0;

string ErrorHeader="Aborting! Commandline input error:\n";
CCDFFileData CdfFile; The CDF file object.
CCELFileData CelFile; The CEL file object.
CCDFProbeSetlnformation ProbeSet; ProbeSet info object bool SingleSnpMode = false;

"******** ****** ** ************** *** *********, cout <<
<< endl;
cout << "Starting Extraction"
<< endl;
CdfFile.SetFileName(CdfFileName.c_str());
if (CdfFile.Exists() == false) {
cout << "Aborting! Unable to find CDF file << CdfFileName.c_str() << endl;
exit(l);
}
else {
if(CdfFile.Read() == false) {
cout << "Aborting! CDF file << CdfFileName.c_str() << " found but unable to be read."
<< endl;
exit(1);
}
}
CelFile.SetFileName(CelFileName.c_str());
if (CelFile.Exists() == false) {
cout << "Aborting! Unable to find CEL file << CelFileName.c_str() << endl;
exit(l);
}
else {
if(CelFile.ReadQ == false) {
cout << "Aborting! CEL file << CelFileName.c_str() << " found but unable to read."
<< endl;
exit(l);
}
}
if ( VERBOSE >= 1) {
CCDFFileHeader &testheader = CdfFile.GetHeaderQ ;
cout<<" CDF file Rows: "<<testheader.GetRows Q<<endl;
cout<<" CDF file Cols: "<<testheader.GetCols Q<<endl;
cout<<" CDF file chip type:

"<<Cdf Fi l e . GetChipTypeQ<<endl ;
cout<<" CEL file chip type:
"<<CelFile.Getleader().GetChipType()<<endl;
cout<<" CEL file Rows:
"<<CelFile.GetHeader().GetRows Q<<endl;
cout<<" CEL file Cols:
"<<CelFile.GetHeader().GetCols()<<endl;
}
/*
std::string CdfChipType = CdfFile.GetChipTypeQ;
std::string CelChipType = CelFile.GetHeader().GetChipType();
std::transform(CdfChipType.begin(), CdfChipType.endQ, CdfChipType.begin(), to-lower());
std::transform(CelChipType.begin(), CelChipType.end(), CelChipType.begin(), to-lower());
if(CdfChipType!=CelChipType) {
cout "******************************************************"<<endl;
cout<<"* Error: Cdf file and Cel file headers do not match!
*"<<endl;
cout<<" Cdf file header: " << CdfChipType << endl;
cout<<" CEL file header: " << CelChipType << endl;
cout<<"* Please check the input Cdf and Cel files *"<<endl;
cout<<"* Program will terminate!
*"<<endl;

cout<< ******************************************************"<<endl;
exit(l);
}
PMAMin = 0;
PMBMin = 0;
int TotalNoOfProbes = 0;
if(1==Normalize II FilterPercent > 0) {
TotalNoOfProbes = GetMeanAndMin(CelFileName, CdfFileName, QuerySnp, &PMAMean, &PMBMean, &PMAMin, &PMBMin);
GetStandardDeviation(CelFileName, CdfFileName, QuerySnp, &PMAStd, &PMBStd, PMAMean, PMBMean);
GetFilterPercent(CelFileName, CdfFileName, QuerySnp, FilterPercent, &PMAFilterPercent, &PMBFilterPercent, &CombinedFilterPercent, TotalNoOfProbes);
cout << "Mean [PMA, PMB]:["
<< PMAMean << PMBMean << endl;
cout << "Standard Deviation: [PMA, PMB]:["
<< PMAStd << ", "
<< PMBStd << endl;
cout << "Min [PMA, PMB]:["
<< PMAMin << PMBMin << "]"
<< endl;
cout << "FilterPercent ("
<< FilterPercent << ") criteria were PMA ("
<< PMAFilterPercent << "), PMB ("
<< PMBFilterPercent << "), and Combined ("
<< CombinedFilterPercent << endl;

PMAFilterPercent -_ (PMAMin-1);
PMBFilterPercent -_ (PMBMin-1);
CombinedFilterPercent -_ (PMAMin-1);
CombinedFilterPercent -_ (PMBMin-1);
}
else {
PMAMean = 0;
PMBMean = 0;
}
/* rescale the means PMAMean-= (PMAMin-1);
PMBMean-= (PMBMin-1);

// If output files need to be produced, following are used string OutputFileName=CelFileName+".db";
string SnpNames=OutputFileName+CdfFileType+"SnpNames.txt";
string EnzymeSnpNames=CdfFileType+"SnpNames.txt";
ofstream SnpNamesFile;
ofstream EnzymeSnpNamesFile;
FILE *Current0utputFile=NULL;

List of files will be written here. This string goes in the "Experiment.txt" file supplied as input to the danalyze program string FileNameList=ExpDir+ExperimentType+CdfFileType+"Files.txt";
ofstream FileNameListWriter;
FileNameListWriter.open(FileNameList.c_str(), fstream::out I
fstream::app);
FileNameListWriter<<OutputFileName<<endl;
FileNameListWriter.close();
cout << "Outputting to "
<< OutputFileName << endl;
if(! SuppressOutputFlag) {
if(!(CurrentOutputFile=fopen(OutputFileName.c_str(), õwb"))) {
cout << "Aborting! Error opening file << OutputFileName << " for writing."
<< endl;
exit(1);
}
else {
InitializeHeader(CurrentOutputFile, Header, SNPCount, ChipType, ProcessMMFlag, SingleSnpMode, Normalize);
// The following will store names of SNPs SnpNamesFile.open(SnpNames.c_str(), ofstream::out);
EnzymeSnpNamesFile.open(EnzymeSnpNames.c_str(), ofstream::out);
}
}

CCDFFileHeader &header = CdfFile.GetHeader();
int nsets = header.GetNumProbeSets();
if (VERBOSE >= 1) cout << " Number of Probe Groups: " << nsets << endl;
int EntryCounter=O;
for (int Counter=0; Counter<nsets; Counter++) {
string CurrentUnitName = CdfFile.GetProbeSetName(Counter);
CdfFile.GetProbeSetlnformation(Counter, ProbeSet);
The following if condition makes sure that only probes containing the word SNP are retained. Controls are ignored if(CurrentUnitName.find(QuerySnp, 0)!= string::npos &&

ProbeSet.GetProbeSetType() GenotypingProbeSetType) {
//cout << "Processing SNP "<<CurrentUnitName<<endl;
// No. of NoOfPMProbes = NumOfCells/
NoOfCellsPerList as there are NoOfCellsPerList cells per SNP (quartet) int NoOfCellsPerList =
2*ProbeSet.GetNumCellsPerList();
int NoOfProbes=ProbeSet.GetNumCells()/
NoOfCellsPerList;
//cout << "NoofQuartets ="<<NoOfPMProbes<<endl;
struct SNPData *inten = new SNPData[NoOfProbes];
int *KeepProbes = new int[NoOfProbes];

// Extract the data and print the results.
Extractlntensities(ProbeSet, CelFile, inten);

/* First get the number of values to be written */
int NoOfProbesTrue=O;
int NoOfProbesFalse=0;
Check to see if we are to process mismatch values and set the number of values accordingly.
if(ProcessMMFlag && NoOfCellsPerList == 2) {
cout << "Error. There are no mismatch intensitites in this chip. Terminating!"
<< endl;
exit(1);
}
for (int j=0;j<NoOfProbes;j++) {
/* rescale values based on min inten[j].pmA -_ (float)(PMAMin-1);
inten[j].pmB -_ (float)(PMBMin-1);
if(FilterPercent > 0 && FilterPercent < 100 &&
(inten[j].pmA <
PMAFilterPercent II
inten[j].pmB <
PMBFilterPercent iI
inten[j].pmA+inten[j].pmB
< CombinedFilterPercent)) {
num_probes_filtered++;
KeepProbes[j] = 0;
NoOfProbesFalse++;
}

else {
KeepProbes[j] = 1;
NoOfProbesTrue++;
}

/* Normalize if necessary ONLY FOR PERFECT
MATCH values, others not implemented */
switch(Normalize) {
case 1:
inten[j].pmA *= 100;
inten[j].pmB *= 100;
inten[j].pmA PMAMean;
inten[j].pmB PMBMean;
break;
default:
break;
}
/* Check that we do not have too big of values if(inten[j].pmA > 65535) {
inten[j].pmA = 65535;
}
if(inten[j].pmB > 65535) {
inten[j].pmB = 65535;
}

/* Check that our values are greater than zero if(inten[j].pmA <= 0 II inten[j].pmB <=0) {
cout<<"Intensity values are negative.
<<"Valus are [pmA, pmB]= ["
<<inten[j].pmA <<
inten[j].pmB << "]. Exiting!"
<<endl;
exit(1);
}
if( (inten[j].pmA==O) II (inten[j].pmB==O)) {
cout << " ***** WARNING: Intensity values zero! for SNP
<< CurrentUnitName << endl;
}

}
assert(NoOfProbes == NoOfProbesTrue +NoOfProbesFalse);

uintl6_t NoOfValues;
if(ProcessMMFlag) {
NoOfValues=(uintl6_t)(4*NoQfProbesTrue);
}
else {
NoOfValues=(uintl6_t)(2*NoOfProbesTrue);
}

if(! SuppressOutputFlag && CurrentOutputFile) {
SnpNamesFile<<CurrentlJnitName<<"\t"
<< (DICEPHIACODE*1000000+SNPCount) << \t << NoOfValues << endl;
EnzymeSnpNamesFile<<CurrentUnitName<<"\t"
<< DICEPHIACODE*1000000+SNPCount <<
õ\t..
<< NoOfValues <<endl;
}
if(NoOfValues > 0) {
SNPCounter++;
int tempNoOfValues = NoOfValues;
int ctrNoOfValues = 0;
if(! SuppressOutputFlag &&
CurrentOutputFile) {
uint32_t tempSNPCounter =
DICEPHIACODE*1000000+SNPCount;
if((int)tempSNPCounter <=
(int)prevSNPCounter) {
fprintf(stderr, "%d\t%d\n", prevSNPCounter, tempSNPCounter);
}
assert((int)tempSNPCounter >
(int)prevSNPCounter);
prevSNPCounter = tempSNPCounter;
//fprintf(stdout, "%d", tempSNPCounter);

tempSNPCounter =
(uint32_t)htonl((uint32_t)tempSNPCounter);
fwrite(&tempSNPCounter, sizeof(uint32_t), 1, CurrentOutputFile);
//fprintf(stdout, "\t%d", NoOfValues);
NoOfValues=htons(NoOfValues);
fwrite(&NoOfValues, sizeof(uintl6_t), 1, CurrentOutputFile);
}
EntryCounter+=2;
for (int j=0;j<NoOfProbes;j++) {

if(KeepProbes[j] == 1) {
if(ProcessMMFlag) {
// Write the perfect match and mismatch values to file. The // type casting is **EXTREMELY** important to save space uintl6_t temp[4]={ (uintl6_t)inten[j].pmA, (uint16_t)inten[j].pmB, (uintl6_t)inten[j].mmA, (uint16_t)inten[j].mmB };

Convert to network byte order if writing to an output file if(!
Suppress0utputFlag && CurrentOutputFile) {
temp[o]=htons(temp[0]);

temp[1]=htons(temp[1]);
temp[2]=htons(temp[2]);
temp[3]=htons(temp[3]);
fwrite(temp, sizeof(uintl6_t), 4, CurrentOutputFile);
}

EntryCounter+=4;

ctrNoOfValues+=4;
}
else {
// Write the perfect match ONLY to file uintl6_t temp[2]={(uintl6_t)inten[j].pmA, (uintl6_t)inten[j].pmB};

if(!
SuppressOutputFlag && CurrentOutputFile) {
// Convert to network byte order if writing to an output file //double mof = (temp[0]+0.0)/(temp[o]+temp[1]+0.0);

fprintf(stdout, "\t%f", maf);
temp[0]=htons(temp[0]);
temp[1]=htons(temp[1]);
fwrite(temp, sizeof(uintl6_t), 2, CurrentOutputFile);
}
EntryCounter+=2;
ctrNoOfValues+=2;
}
}
}
//fprintf(stdout, "\n");
if(ctrNoOfValues != tempNoOfValues) {
fprintf(stderr, "ctrNoOfValues(%d) != tempNoOfValues(%d) on SNP # %d\n", ctrNoOfValues, tempNoOfValues, SNPCount);
exit(l);
}
}
else {
num_snps_filtered++;
}
SNPCount++;

delete []inten;
delete []KeepProbes;
}
else {
//cout << "Current ProbeSet = " <<
CdfFile.GetProbeSetName(Counter) << endl;
}
}

if (VERBOSE >= 1) cout << " Total number of entries stored in output file:
<< EntryCounter << endl;

if(! SuppressOutputFlag) {

/* Write the results to the header in the output file WriteResultsToHeader( CurrentOutputFile, Header, SNPCount, (int)PMAMean, (int)PMBMean);

/* Close the current output file fclose(CurrentOutputFile);
/* Close the snp names file SnpNamesFile.close();
EnzymeSnpNamesFile.close();
}

cout << "Encountered << SNPCounter << " Snps and Wrote << SNPCount << " to file"
<< endl;
if(FilterPercent > 0) {
cout << "Out of << TotalNoOfProbes << " no. probes filtered because of FilterPercent:
<< num_probes_filtered << endl;
cout << "Total no. snps filtered because of FilterPercent:

<< num_snps_filtered << endl;
}
coot << "Extraction Complete"
<< endl;
"***********************************************, cout <<
<< endl;
return 0;
}

Extracts the intensities for a single direction probes.
Return True if successful.

boot Extractlntensities(CCDFProbeSetlnformation &probeSet, CCELFileData &celFile, struct SNPData *intensities) {

int NoOfGroups=probeSet.GetNumGroups();
// cout << "NoOfGroups = "<< NoOfGroups << endl;
if (NoOfGroups < 2) return false;

Affy Docs state that the number of groups can be either 2 or 4. The chip design strategy for genotyping probe sets is to use a set of PM/MM probe pairs to interrogate the surrounding bases of the SNP for the forward and or reverse target for both the A and B alleles.

The CDF file defines a grouping of PM/MM probe pairs by direction and allele. Genotyping probe sets typically contain 2 or 4 of these groups.

The 4 group probe set is defined by:

//Group 1 - probes interrogating the forward direction of the A
allele target.
//Group 2 - probes interrogating the forward direction of the B
allele target.
//Group 3 - probes interrogating the reverse direction of the A
allele target.
//Group 4 - probes interrogating the reverse direction of the B
allele target.

//The 2 group probe set is defined by:

//Group 1 - probes interrogating the X direction of the A allele target.
//Group 2 - probes interrogating the X direction of the B allele target.
Where X is either the forward or reverse direction.

Thus, based on #of groups (2 or 4), only forward/only reverse/
both directions need to be searched In the following, groupA[0]=first A group ; groupB[0]=first B group // groupA[1]=second A group (defined if necessary) groupB[1]=second B group (defined if necessary // [0] will correspond to one direction and [1] to the other //CCDFProbeGroupInformation groupA[NoOfGroups/2];
//CCDFProbeGroupInformation groupB[NoOfGroups/2];
//int nProbePairs[NoOfGroups/2];
CCDFProbeGroupInformation *groupA=new CCDFProbeGroupInformation[NoOfGroups/2];
CCDFProbeGroupInformation *groupB=new CCDFProbeGroupInformation[NoOfGroups/2];
int *nProbePairs=new int[NoOfGroups/2];
for (int i=0;i<NoOfGroups/2;i++) {
probeSet.GetGroupInformation(2*i, groupA[i]);
probeSet.GetGroupInformation(2*i+1, groupB[i]);
nProbePairs[i]=groupA[i].GetNumLists();
}

// Get each pair of probes for the A and B alleles and store the results in the intensities object.
// If the TBase values are the different from the A and B then this is the probe which interrogates the SNP location. The position values will be adjusted // based on this position.
int counter=0;

for(int j=0;j<NoOfGroups/Z;j++) {

int NoOfProbesPerPair=groupA[j].GetNumCellsPerList();
if(NoOfProbesPerPair != groupB[j].GetNumCellsPerList()) {
cout << "Error. The number of Probes Per Pair do not match."
<< "W' << "Group A:
<< NoOfProbesPerPair << "\tGroup B: "
<< groupB[j].GetNumCellsPerList() << endl;
cout << "Terminating!"
<< endl;
exit(1);
}
CCDFProbelnformation *cellA = new CCDFProbelnformation[NoOfProbesPerPair];
CCDFProbelnformation *cellB = new CCDFProbelnformation[NoOfProbesPerPair];
float *intenA = new float[NoOfProbesPerPair];
float *intenB = new float[NoOfProbesPerPair];
for(int i=0,icel=0;
i<nProbePairs[j];
i++,icel+=NoOfProbesPerPair, counter++) {
Get the intensities for(int k=0;k<NoOfProbesPerPair;k++) {
groupA[j].GetCell(icel+k, cellA[k]);
intenA[k] =
celFile.Getlntensity(cellA[k].GetX(), cellA[k].GetYO );
groupB[j].GetCell(icel+k, cel1B[k]);
intenB[k] =
celFile.Getlntensity(cellB[k].GetX(), cellB[k].GetYQ )=
}
If there are no mismatch probes then GetNumCellsPerList() should return 1 // If there are mismatch probes then GetNumCellsPerList() should return 2 if(groupA[j].GetNumCellsPerList() == 1) {
intensities[counter].Direction=groupA[j].GetDirection();

if (IsPerfectMatch(celLA[0]) == true) {
intensities[counter].pmA =
intenA[0];
}
else {
cout << "Error. cellA is not a Perfect Match. Terminating!"
<< endl;
exit(1);
}
if (IsPerfectMatch(celiB[0]) == true) {
intensities[counter].pmB =
intenB[0];
}
else { cout << "Error. cellB is not a Perfect Match. Terminating!"
<< endl;
exit(1);
If }
else {

intensities[counter] .Direction=groupA[j].GetDirection Q ;

if (IsPerfectMatch(cellA[0]) == true) {
intensities[counter].pmA =
intenA[0];
intensities[counter].mmA =
intenA[1];
}
else {
intensities[counter].mmA =
intenA[0];
intensities[counter].pmA =
intenA[1];
}
if (IsPerfectMatch(cellB[0]) == true) {
intensities[counter].pmB =
intenB[0];
intensities[counter].mmB =
intenB[1];
} else {
intensities[counter].mmB =

intenB[0];
intensities[counter].pmB =
intenB[1];
}
}
}
}

return true;
}

boot IsPerfectMatch(const CCDFProbelnformation &cell) {

Determines if the probe is a PM probe or not Code based on Affy docs and hopefully Affy will not change their system!
char pbase = tolower(cell.GetPBase());
char tbase = tolower(cell.GetTBaseQ);
return ( (pbase 'a' && tbase 't') II (pbase 't' &&
tbase 'a') II
(pbase 'g' && tbase 'c') II (pbase 'c' &&
tbase 'g') );
}
int GetMeanAndMin(string CefFileName, string CdfFileName, string QuerySnp, double * PMAMean, double * PMBMean, int * PMAMin, int * PMBMin) {
uint32_t SNPCount=0;
uint32_t sumPMA=O;
uint32_t sumPMB=O;
int pmaMin = 0;
int pmbMin = 0;
int ctr=0;

int TotalNoOfProbes = 0;

string ErrorHeader="Aborting! Commandline input error:\n";

CCDFFileData CdfFile; The CDF file object.
CCELFileData CelFile; The CEL file object.
CCDFProbeSetlnformation ProbeSet; ProbeSet info object CdfFile.SetFileName(CdfFileName.c_str());
if (CdfFile.Exists() == false) {
cout << "Aborting! Unable to find CDF file << CdfFileName.c_str() << endl;
exit(l);
}
else {
if(CdfFile.Read() == false) {
tout << "Aborting! CDF file << CdfFileName.c_str() << " found but unable to be read."
<< endl;
exit(1);
}
}
CelFile.SetFileName(CelFileName.c_strQ);
if (CelFile.Exists O == false) {
tout << "Aborting! Unable to find CEL file << CelFileName.c_strO
<< endl;
exit(l);
}
else {
if(CelFile.Read O == false) {
cout << "Aborting! CEL file << CelFileName.c_str() << " found but unable to read."
<< endl;
exit(l);
}
}
CCDFFileHeader &header = CdfFile.GetHeader O ;
int nsets = header.GetNumProbeSets();

int EntryCounter=0;
for (int Counter=0; Counter<nsets; Counter++) {
string CurrentUnitName = CdfFile. GetProbeSetName(Counter);
CdfFile.GetProbeSetlnformation(Counter, ProbeSet);
The following if condition makes sure that only probes containing the word SNP are retained. Controls are ignored if(CurrentUnitName.find(QuerySnp, 0)!= string::npos &&
ProbeSet.GetProbeSetType() GenotypingProbeSetType) {
//cout << "Processing SNP "<<CurrentUnitName<<endl;
int NoOfCellsPerList =
2*ProbeSet.GetNumCellsPerList();
int NoOfProbes=ProbeSet.GetNumCells Q/
NoOfCellsPerList;
//cout << "NoofQuartets ="<<NoOfPMProbes<<endl;
struct SNPData *inten = new SNPData[NoOfProbes];
// Extract the data and print the results.
Extractlntensities(ProbeSet, CelFile, inten);
uintl6_t NoOfValues;
// Check to see if we are to process mismatch values and set // the number of values accordingly.
NoOfValues=2*NoOfProbes;
NoOfValues=htons(NoOfValues);
EntryCounter++;
SNPCount++;
for (int j=0;j<NoOfProbes;j++) {
/* Check that our values are greater than zero if(inten[j].pmA <= 0 II inten[j].pmB <=0) {
cout<<"Intensity values are negative.
<<"Valus are [pmA, pmB]= ["
<<inten[j].pmA <<
inten[j].pmB << "]. Exiting!"
<<endl;
exit(l);
}
sumPMA +_ (uint32_t)inten[j].pmA;
sumPMB += (uint32_t)inten[j].pmB;
pmaMin = (pmaMin < (int)inten[j].pmA)?
pmaMin:((int)inten[j].pmA);
pmbMin = (pmbMin < (int)inten[j].pmB)?
pmbMin:((int)inten[j].pmB);
TotalNoOfProbes++;
ctr++;

}

delete []inten;
}
else {
//cout << "Current ProbeSet = <<
CdfFile.GetProbeSetName(Counter) << endl;
}
}
(*PMAMean) = ( (double)sumPMA ) / ( (double)ctr )=
(*PMBMean) = ( (double)sumPMB ) / ( (double)ctr )=
(*PMAMin) = pmaMin;
(*PMBMin) = pmbMin;
return TotalNoOfProbes;
}
void GetStandardDeviation(string CelFileName, string CdfFileName, string QuerySnp, double * PMAStandardDeviation, double * PMBStandardDeviation, double PMAMean, double PMBMean) {
uint32_t SNPCount=O;
uint32_t ctr=O;
uint32_t curPMA=O;
uint32_t curPMB=O;
double PMASS = 0; PMA squared sum double PMBSS = 0; PMB squared sum string ErrorHeader="Aborting! Commandline input error:\n";
CCDFFileData CdfFile; The CDF file object.
CCELFileData CelFile; The CEL file object.
CCDFProbeSetlnformation ProbeSet; // ProbeSet info object CdfFile.SetFileName(CdfFileName.c_str());
if (CdfFile.Exists Q == false) {
cout << "Aborting! Unable to find CDF file << CdfFileName.c_str() << endl;
exit(1);
}
else {
if(CdfFile.Reod Q == false) {

cout << "Aborting! CDF file << CdfFileName.c_str() << " found but unable to be read."
<< endl;
exit(1);
}
}
CelFile.SetFileName(CelFileName.c_str());
if (CelFile.Exists() == false) {
cout << "Aborting! Unable to find CEL file << CelFileName.c_str() << endl;
exit(l);
}
else {
if(CelFile.Read() == false) {
cout << "Aborting! CEL file << CelFileName.c_str() << " found but unable to read."
<< endl;
exit(l);
}
}
CCDFFileHeader &header = CdfFile.GetHeader();
int nsets = header.GetNumProbeSets Q ;
int EntryCounter=0;
for (int Counter=0; Counter<nsets; Counter++) {
string CurrentUnitName = CdfFile.GetProbeSetName(Counter);
CdfFile. GetProbeSetlnformation(Counter, ProbeSet);
The following if condition makes sure that only probes containing the word SNP are retained. Controls are ignored if(CurrentUnitName.find(QuerySnp, 0)!= string::npos) {
//cout << "Processing SNP "<<CurrentUnitName<<endl;
int NoOfCellsPerList =
2*ProbeSet.GetNumCellsPerList();
int NoOfProbes=ProbeSet.GetNumCells(/
NoOfCellsPerList;
//cout << "NoofQuartets ="<<NoOfPMProbes<<endl;
struct SNPData *inten = new SNPData[NoOfProbes];
// Extract the data and print the results.
Extractlntensities(ProbeSet, CelFile, inten);

uintl6_t NoOfValues;
NoOfValues=2*NoofProbes;
NoOfValues=htons(NoOfValues);
EntryCounter++;
SNPCount++;

for (int j=0;j<NoOfProbes;j++) {
if( (inten[j].pmA==O) II (inten[j].pmB==O)) {
cout << " ***** WARNING: Intensity values zero! for SNP
<< CurrentUnitName << endl;
}
curPMA = (uint32_t)inten[j].pmA;
curPMB = (uint32_t)inten[j].pmB;
PMASS +_ ( ( (double)curPMA ) - PMAMean ) ( ( (double)curPMA ) - PMAMean );
PMBSS +_ ( ( (double)curPMB ) - PMBMean ) ( ( (double)curPMB ) - PMBMean );
ctr++;
}

delete []inten;
}
else {
//cout << "Current ProbeSet = <<
CdfFile.GetProbeSetName(Counter) << endl;
}
}
(*PMAStandardDeviation) = sqrt( PMASS / ( (double)(ctr-1) ) );
(*PMBStandardDevintion) = sqrt( PMBSS / ( (double)(ctr-1) ) );
}
void GetFilterPercent(string CelFileName, string CdfFileName, string QuerySnp, int FilterPercent, int * PMAFilterPercent, int * PMBFilterPercent, int * CombinedFilterPercent, int TotalNoOfProbes) {
uint32_t SNPCount=0;
int ctr=0;

int * PMAv;
int * PMBv;
int * Cv;

int NoOfProbes=0;
if(FilterPercent == 0) {
return;
}

fprintf(stderr, "Currently finding FilterPercent limits. This may take a while ...\n");
assert(FilterPercent>=0 && FilterPercent <= 100);
PMAv = (int*)malloc(sizeof(int)*TotalNoOfProbes);
PMBv = (int*)malloc(sizeof(int)*TotalNoOfProbes);
Cv = (int*)malloc(sizeof(int)*TotalNoOfProbes);
(*PMAFilterPercent) = 0;
(*PMBFilterPercent) = 0;
(*CombinedFilterPercent) = 0;

string ErrorHeader="Aborting! Commandline input error:\n";
CCDFFileData CdfFile; The CDF file object.
CCELFileData CelFile; The CEL file object.
CCDFProbeSetlnformation ProbeSet; ProbeSet info object CdfFile.SetFileName(CdfFileName.c_str());
if (CdfFile.Exists() == false) {
cout << "Aborting! Unable to find CDF file << CdfFileName.c_str() << endl;
exit(1);
}
else {
if(CdfFile.Read() == false) {
cout << "Aborting! CDF file << CdfFileName.c_str() << " found but unable to be read."
<< endl;
exit(1);
}
}
CelFile.SetFileName(CelFileName.c_str());
if (CelFile.Exists() == false) {

tout << "Aborting! Unable to find CEL file << CelFileName.c_str() << endl;
exit(l);
}
else {
if(CelFile.Read() == false) {
cout << "Aborting! CEL file << CelFileName.c_str() << " found but unable to read."
<< endl;
exit(1);
}
}
CCDFFileHeader &header = CdfFile.GetHeader();
int nsets = header.GetNumProbeSets O ;

int EntryCounter=0;
int i=0;
for (int Counter=0; Counter<nsets; Counter++) {
string CurrentUnitName = CdfFile. GetProbeSetName(Counter);
CdfFile. GetProbeSetlnformation(Counter, ProbeSet);
The following if condition makes sure that only probes containing the word SNP are retained. Controls are ignored if(CurrentUnitName.find(QuerySnp, 0)!= string::npos &&
ProbeSet.GetProbeSetType() GenotypingProbeSetType) {
//tout << "Processing SNP "<<CurrentUnitName<<endl;
int NoOfCellsPerList =
2*ProbeSet.GetNumCellsPerList();
NoOfProbes=ProbeSet.GetNumCellsQ/NoOfCellsPerList;
//tout << "NoofQuartets ="<<NoOfPMProbes<<endl;
struct SNPData *inten = new SNPData[NoOfProbes];
// Extract the data and print the results.
Extractlntensities(ProbeSet, CelFile, inten);
uintl6_t NoOfValues;
// Check to see if we are to process mismatch values and set // the number of values accordingly.
NoOfValues=2*NoOfProbes;
NoOfValues=htons(NoOfValues);

EntryCounter++;
SNPCount++;
for (int j=0;j<NoOfProbes;j++) {
/* Check that our values are greater than zero if(inten[j].pmA <= 0 II inten[j].pmB <=0) {
cout<<"Intensity values are negative.
<<"Valus are [pmA, pmB]= ["
<<inten [j] . pmA <<
inten[j].pmB << "]. Exiting!"
<<endl ;
exit(l);
}
PMAv[i] = (int)inten[j].pmA;
PMBv[i] = (int)inten[j].pmB;
Cv[i] = (int)(inten[j].pmA + inten[j].pmB);
i++;
}
delete ^ inten;
}
else {
//cout << "Current ProbeSet = " <<
CdfFile.GetProbeSetName(Counter) << endl;
}
}
assert(i = TotalNoOfProbes);

fprintf(stderr, "Out of %d, currently on:\t %10d", 3*TotalNoOfProbes, 0);
sort(PMAv, 0, TotalNoOfProbes-1, 0);
sort(PMBv, 0, TotalNoOfProbes-l, TotalNoOfProbes);
sort(Cv, 0, TotalNoOfProbes-1, 2*TotalNoOfProbes);
for(i=0;i<TotalNoOfProbes-1;i++) {
assert(PMAv[i] <= PMAv[i+1]);
assert(PMBv[i] <= PMBv[i+1]);
assert(Cv[i] <= Cv[i+1]);
}
fprintf(stderr, "\n");

ctr = (int)(TotalNoOfProbes/100.0)*FilterPercent;
assert(ctr>=O && ctr<TotalNoOfProbes);
(*PMAFilterPercent) = PMAv[ctr];
(*PMBFilterPercent) = PMBv[ctr];

(*CombinedFilterPercent) = Cv[ctr];

free(PMAv);
free(PMBv);
free(Cv);
return;
}

#define MATHLIB_STANDALONE 1 #include "Rmath.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <limits.h> /* For INT_MIN
#include <math.h>
#include <assert.h>
#include "Statistic.h"
#include "Definitions.h"
double ComputeStatistic( SnpEntry **MixtureEntries, int NumSnps, SnpEntry **PeopleEntries, double **PopulationMean, int TestStatistic, int DistanceMeasure, int CorrelationXDistance, int CorrelationYDistance, int Output, int *NumSnpsUsed) {
double T=-1.0;

/* Get the Test Statistic specified by the user switch(TestStatistic) {
case 1:
assert(CorrelationXDistance == 0);
assert(CorrelationYDistance 0);
T = GetMeanDifference(MixtureEntries, NumSnps, PeopleEntries, PopulationMean, DistanceMeasure, Output, NumSnpsUsed);
break;
case 2:
assert(DistanceMeasure == 0);
T = GetPearsonCorrelation(MixtureEntries, NumSnps, PeopleEntries, PopulationMean, CorrelationXDistance, CorrelationYDistance, Output, NumSnpsUsed);
break;
case 3:
assert(DistanceMeasure == 0);
T = GetSpearmanCorrelation(MixtureEntries, NumSnps, PeopleEntries, PopulationMean, CorrelationXDistance, CorrelationYDistance, Output, NumSnpsUsed);

break;
case 4:
assert(CorrelationXDistance == 0);
assert(CorrelationYDistance == 0);
T = -1.0*GetWilcoxonSignRankTest(MixtureEntries, NumSnps, PeopleEntries, PopulationMean, DistanceMeasure, Output, NumSnpsUsed);
break;
case 5:
T = GetLikelihoodRatio(MixtureEntries, NumSnps, PeopleEntries, PopulationMean, DistanceMeasure, Output, NumSnpsUsed);
break;
default:
fprintf(stderr, "Error. Incorrect Test Statistic %d. Terminating!\n", TestStatistic);
exit(l);
}
return T;
}

void ComputeStatistics( double **PopRefTestStatistics, int NumPopRef, char **PopRefFileNames, double **PeopleTestStatistics, int NumPeople, char **PeopleFileNames, double **MeanPeopleTestStatistics, int NumMeanPeople, char **MeanPeopleFileNames, int NormalizeTestStatistic, int MeanNormalize, int PrintSummary, char *OutputFileName) {
int i;
double DistMean=0.0;
double DistVar=0.0;
FILE *Fp;

/* First Get Mean to Normalize */
if(MeanNormalize==1) {
/* Use specified people */
DistMean = 0.0;
for(i=0; i<NumMeanPeople;i++) {

DistMean +_ (*MeanPeopleTestStatistics)[i];
}
DistMean /= NumMeanPeople;
}
else {
/* Use the reference population */
DistMean=0.0;
for(i=0;i<NumPopRef;i++) {
DistMean += (*PopRefTestStatistics)[i];
}
DistMean /_ (double)NumPopRef;
}
/* Get the Variance to Normalize */
DistVar=0.0;
for(i=0;i<NumPopRef;i++) {
DistVar += ((*PopRefTestStatistics)[i] -DistMean)*((*PopRefTestStatistics)[i] - DistMean);
}
assert(NumPopRef > 1);
DistVar /_ (double)(NumPopRef - 1.0); /* Sample variance /* Open output file */
if(!(Fp=fopen(OutputFileName, "w"))) {
fprintf(stderr, "Error. Could not open %s for writing.
Terminating!\n", OutputFileName);
exit(1);
}
/* Compute Test Statistics for the Reference Population if(PrintSummary==1) {
fprintf(stderr, "Analyzing Reference Population:\n");
}
for(i=0;i<NumPopRef;i++) {
double T = (*PopRefTestStatistics)[i];
if(NormalizeTestStatistic==1) {
T = (T - DistMean)/sgrt(DistVar);
}
T = -1.0*T;
double pvalue = pnorm(T, 0.0, 1.0, 1, 0);
if(PrintSummory == 1) {
fprintf(stderr, "%s\t%.251f\t%1.Z51f\t%s\n", PopRefFileNames[i], T, pvalue, "ReferencePopulation");
}
fprintf(Fp, "%s\t%.25lf\t%1.251f\t%s\n", PopRefFileNames[i], T, pvalue, "ReferencePopulation");

fflush(Fp);
}
if(PrintSummary==1) {
fprintf(stderr, "%s", BREAK_LINE);
}

/* Compute Test Statistics for the People of Interest */
if(PrintSummary=l) {
fprintf(stderr, "%s", BREAK_LINE);
fprintf(stderr, "Analyzing People of Interest:\n");
}
for(i=0;i<NumPeople;i++) {
double T = (*PeopleTestStatistics)[i];
if(1==NormalizeTestStatistic) {
T = (T - DistMean)/sqrt(DistVar);
}
T = -1.0*T;
double pvalue = pnorm(T, 0.0, 1.0, 1, 0);
if(PrintSummary == 1) {
fprintf(stderr, "%s\t%.25lf\t%1.251f\t%s\n", PeopleFileNames[i], T, pvalue, "PeopleOfInterest");
}
fprintf(Fp, "%s\t%.251f\t%1.251f\t%s\n", PeopleFileNames[i], T, pvalue, "PeopleOfInterest");
fflush(Fp);
}
fclose(Fp);
}

double GetMeanDifference( SnpEntry **MixtureEntries, int NumSnps, SnpEntry **PeopleEntries, double **PopulationMean, int DistanceMeasure, int Output, int *NumSnpsUsed) {
int i;
intn=0;
double DiffSum = 0.0;
double DiffMean = 0.0;
double DiffVar = 0.0;
/* Get the Mean */
DiffMean=0;

n=0;
for(i=0;i<NumSnps;i++) {
if((*PeopleEntries)[i].ProbeMean >= 0.0 &&
(*MixtureEntries)[i].ProbeMean >= 0.0 &&
(*PopulationMean)[i] >= 0.0) {

double RAS = (*PeopleEntries)[i].ProbeMean;
double MixtureRAS = (*MixtureEntries)[i].ProbeMean;
if(Output==l) {
fprintf(stdout, "%lf\t%lf\t%lf\n", RAS, MixtureRAS, (*PopulationMean)[i]);
}
double Diff = GetDistance(RAS, MixtureRAS, (*PopulationMean)[i], DistanceMeasure);
DiffSum += Diff;
n++;
}
}
DiffMean = DiffSum/(double)n;
(*NumSnpsUsed)=n;
/* Get the variance DiffVar = 0.0;
n=0;
for(i=0;i<NumSnps;i++) {
if((*PeopleEntries)[i].ProbeMean >= 0.0 &&
(*MixtureEntries)[i].ProbeMean >= 0.0 &&
(*PopulationMean)[i] >= 0.0) {

double RAS = (*PeopleEntries)[i].ProbeMean;
double MixtureRAS = (*MixtureEntries)[i].ProbeMean;
double Diff = GetDistance(RAS, MixtureRAS, (*PopulationMean)[i], DistanceMeasure);
DiffVar +_ (Diff - DiffMean)*(Diff - DiffMean);
n++;
}
}
DiffVar (n-1.0);
assert(n==(*NumSnpsUsed));
double Numerator = DiffMean;

double Denominator = sqrt(DiffVar)/sqrt(n);
double A = Numerator/Denominator;
/*
return DiffMean;
return A;
}
double GetPearsonCorrelation( SnpEntry **MixtureEntries, int NumSnps, SnpEntry **PeopleEntries, double **PopulationMean, int CorrelationXDistance, int CorrelationYDistance, int Output, int *NumSnpsUsed) {
int i;
double SumX = 0.0;
double SumY = 0.0;
double SumSqX = 0.0;
double SumSqY = 0.0;
double SumCoProduct = 0.0;
int n = 0;
for(i=0;i<NumSnps;i++) {
if((*PeopleEntries)[i].ProbeMean >= 0.0 &&
(*MixtureEntries)[i].ProbeMean >= 0.0 &&
(*PopulationMean)[i] >= 0.0) {

double RAS = (*PeopleEntries)[i].ProbeMean;
double MixtureRAS = (*MixtureEntries)[i].ProbeMean;
double X=0.0;
double Y=0.0;
X = GetCorrelationDistance(RAS, MixtureRAS, (*PopulationMean)[i], CorrelationXDistance);
Y = GetCorrelationDistance(RAS, MixtureRAS, (*PopulationMean)[i], CorrelationYDistance);
if(Output==1) {
fprintf(stdout, "%lf\t%lf\t%lf\t%lf\t%lf \n", RAS, MixtureRAS, (*PopulationMean)[i], X, Y);
}

SumX += X;
SumY += Y;
SumSgX += X*X;
SumSgY += Y*Y;
SumCoProduct += X*Y;
n++;
}
}
(*NumSnpsUsed)=n;
double cNumerator = (SumCoProduct) - SumX*SumY/n;
double cDenominator = sqrt(SumSqX - SumX*SumX/n)*sgrt(SumSqY -SumY*SumY/n);
double c = cNumerator/cDenominator;
return c;
}
double GetSpearmanCorrelation( SnpEntry **MixtureEntries, int NumSnps, SnpEntry **PeopleEntries, double **PopulationMean, int CorrelationXDistance, int CorrelationYDistance, int Output, int *NumSnpsUsed) {
int i;
intn=0;
double * MixtureRASs;
double * RASs;

MixtureRASs = (double*)malloc(sizeof(double)*NumSnps);
RASs = (double*)malloc(sizeof(double)*NumSnps);
for(i=0;i<NumSnps;i++) {
if((*PeopleEntries)[i].ProbeMean >= 0.0 &&
(*MixtureEntries)[i].ProbeMean >= 0.0 &&
(*PopulationMean)[i] >= 0.0) {

double RAS = (*PeopleEntries)[i].ProbeMean;

double MixtureRAS = (*MixtureEntries)[i].ProbeMean;

double X=0.0;
double Y=0.0;
X = GetCorrelationDistance(RAS, MixtureRAS, (*PopulationMean)[i], CorrelationXDistance);
Y = GetCorrelationDistance(RAS, MixtureRAS, (*PopulationMean)[i], CorrelationYDistance);
if(Output==1) {
fprintf(stdout, "%lf\t%lf\t%lf\t%lf\t%lf \n", RAS, MixtureRAS, (*PopulationMean)[i], X, Y);
}
MixtureRASs[n] = X;
RASs[n] = Y;
n++;
}
}
(*NumSnpsUsed)=n;
/* Get Spearmann rank correlation /* Update RAS Ranks */
SortDoubles(&RASs, &MixtureRASs, 0, n-1);
for(i=0;i<n;i++) {
RASs[i] = i+1.0;
IF
SortDoubles(&MixtureRASs, &RASs, 0, n-1);
for(i=0;i<n;i++) {
MixtureRASs[i] = i+1.0;
IF

double Numerator=0.0;
for(i=0;i<n;i++) {
/*
if(Output==1) {
fprintf(stdout, "%lf\t%lf\n", MixtureRASs[i], RASs[i]);
}

Numerator +_ (MixtureRASs[i] - RASs[i])*(MixtureRASs[i] -RASs[i]);
IF
Numerator = 6.0*Numerator;

if(Output==1) {
fprintf(stdout, "Numerator:%lf\n", Numerator);
}
double Denominator=(double)n*((double)n*n - 1.0);
double c = 1.0 - Numerator/Denominator;
if(Output==1) {
fprintf(stdout, "c:%lf\n", c);
}

/* Free memory */
free(MixtureRASs);
free(RASs);

/* test statistic */
double ans = c/sgrt((1.0 - c*c)/(n-2.0));
fprintf(stdout, "c:%lf\tans:%lf\tn:%d\n", c, ans, n);
return ans;
}
double GetWilcoxonSignRankTest( SnpEntry **MixtureEntries, int NumSnps, SnpEntry **PeopleEntries, double **PopulationMean, int DistanceMeasure, int Output, int *NumSnpsUsed) {
int i;
int n = 0;

double *Rank; /* Rank double *Sign; /* Sign Rank = (double*)malloc(sizeof(double)*NumSnps);
Sign = (double*)malloc(sizeof(double)*NumSnps);
for(i=0;i<NumSnps;i++) {
if((*PeopleEntries)[i].ProbeMean >= 0.0 &&
(*MixtureEntries)[i].ProbeMean >= 0.0 &&

(*PopulationMean)[i] >= 0.0) {

double RAS = (*PeopleEntries)[i].ProbeMean;
double MixtureRAS = (*MixtureEntries)[i].ProbeMean;
double Diff=0.0;
Diff = GetDistance(RAS, MixtureRAS, (*PopulationMean)[i], DistanceMeasure);
if(Output==l) {
fprintf(stdout, "%lf\t%lf\t%lf\t%lf\n", RAS, MixtureRAS, (*PopulationMean)[i], Diff);
}
int j=n;
Rank[j] = Diff;
if(Rank[j] > 0) {
Sign[j] = 1.0;
}
else {
Sign[j] = 0.0;
}
n++;
}
}
(*NumSnpsUsed)=n;
/* Update to absolute values for(i=0;i<n;i++) {
Rank[i] = fabs(Rank[i]);
}
/* Sort the absolute values to get the rank */
SortDoubles(&Rank, &Sign, 0, n-1);
/* Update rank */
for(i=0;i<n;i++) {
Rank[i] = i+1.0;
}

/* Get the sum */
double sum = 0.0;
for(i=0;i<n;i++) {
sum += Rank[i]*Sign[i];
}

/* Free memory free(Rank);
free(Sign);
/* test statistic (z score)*/
double ans = (sum - n*(n-1.0)/4.0)/(n*(n+1)*(2.0*n + 1.0)/24.0);
return ans;
}
double GetDistance(double RAS, double MixtureRAS, double PopulationMean, int Method) {
double MixtureDiff = fabs(RAS - MixtureRAS);
double PopDiff = fabs(RAS - PopulationMean);
double Diff = 0.0;

switch(Method) {
case 1:
/* Method 1 Diff = PopDiff - MixtureDiff;
break;
case 2:
/* Method 2 */
Diff = PopDiff - MixtureDiff;
if(Diff > 0) {
Diff = 1.0;
}
else if(Diff<0) {
Diff = -1.0;
}
else {
Diff = 0.0;
}
break;
case 3:
/* Method 3 */
if( MixtureRAS < 0.5) {
Diff = (PopDiff - MixtureDiff)/MixtureRAS;
}
else {
Diff = (PopDiff - MixtureDiff)/(1.0 -MixtureRAS);
}

break;
case 4:
/* Method 4 Diff = (PopDiff - MixtureDiff)/(MixtureRAS*(1.0-MixtureRAS));
break;
case 5:
/* Method 5 */
Diff = (MixtureRAS - RAS);
break;
case 6:
/* Method 6 */
Diff = fabs(MixtureRAS - RAS);
break;
default:
fprintf(stderr, "Error wrong method in Get Mean Difference in Statistic.c\n");
exit(1);
break;
}
return Diff;
}
double GetCorrelationDistance(double RAS, double MixtureRAS, double PopulationMean, int CorrelationDistance) {
double Ans=0.0;
switch(CorrelationDistance) {
case 1:
Ans = MixtureRAS - PopulationMean;
break;
case 2:
Ans = MixtureRAS - PopulationMean;
break;
case 3:
Ans = (RAS - PopulationMean);
break;
case 4:
Ans = (RAS - PopulationMean);
break;
case 5:
Ans = (PopulationMean - MixtureRAS);
break;
case 6:
Ans = (PopulationMean - MixtureRAS);

break;
case 7:
Ans = RAS - MixtureRAS;
break;
case 8:
Ans = RAS - MixtureRAS;
break;
case 9:
Ans = RAS;
break;
case 10:
Ans = MixtureRAS;
break;
case 11:
Ans = PopulationMean;
break;
default:
fprintf(stderr, "Error. Method %d not implemented in GetCorrelationDistance. Terminating!\n", CorrelationDistance);
exit(1);
}
if(CorrelationDistance==2 II
CorrelationDistance==4 II
CorrelationDistance==6 II
CorrelationDistance==8) {
if(Ans > 0) {
Ans = 1.0;
}
else if(Ans < 0) {
Ans = -1.0;
}
else {
Ans = 0.0;
}
}
return Ans;
}

void SortDoubles(double **A, double **B, int low, int high) {
/* MergeSort! */
int mid = (low + high)/2;
int start-upper = mid + 1;

int end-upper = high;
int start-lower = low;
int end-lower = mid;
int ctr, i;

double *tempA;
double *tempB;
if(low >= high) {
return;
}

/* Partition the list into two lists and then sort them recursively SortDoubles(A, B, low, mid);
SortDoubles(A, B, mid+1, high);

tempA = (double*)malloc(sizeof(double)*(high-low+1));
tempB = (double*)malloc(sizeof(double)*(high-low+1));
/* Merge the two lists ctr = 0;
while( (start_lower<=end_lower) && (start_upper<=end_upper) ) {
if ((*A)[start_lower] <= (*A)[start_upper]) {
tempA[ctr] =(*A)[start_lower];
tempB[ctr] =(*B)[start_lower];
start_lower++;
}
else {
tempA[ctr] _ (*A)[start_upper];
tempB[ctr] = (*B)[start-upper];
start_upper++;
}
ctr++;
}
if(start_lower<=end_lower) {
while(start_lower<=end_lower) {
tempA[ctr] =(*A)[start_lower];
tempB[ctr] =(*B)[start_lower];
ctr++;
start_lower++;
}
}
else {
while(start_upper<=end_upper) {

tempA[ctr] = (*A)[start_upper];
tempB[ctr] = (*B)[start-upper];
ctr++;
start_upper++;
}
}
for(i=low, ctr=0;i<=high;i++, ctr++) {
(*A)[i] = tempA[ctr];
(*B)[i] = tempB[ctr];
/* Check to see if we sorted properly */
if(ctr>O && (*A)[i-1] > (*A)[i]) {
fprintf(stderr, "Sorted improperly\n");
fprintf(stderr, "%lf\t%lf\n", (*A)[i-1], (*A)[i]);
exit(l);
}
}
free(tempA);
free(tempB);
}

double GetLikelihoodRatio( SnpEntry **MixtureEntries, int NumSnps, SnpEntry **PeopleEntries, double **PopulationMean, int DistanceMeasure, int Output, int *NumSnpsUsed) {
int i;
int n = 0;

/* use the log likelihood since it is numerically stable */
double log-likelihood-mixture=O; /* log-likelihood that we are in the mixture */
double log-likelihood-population=0; /* log-likelihood that we are in the population */

/* Get the Mean n=0;
for(i=0;i<NumSnps;i++) {
if((*PeopleEntries)[i].ProbeMean >= 0.0 &&
(*MixtureEntries)[i].ProbeMean >= 0.0 &&

(*PopulationMean)[i] >= 0.0) {

double RAS = (*PeopleEntries)[i].ProbeMean;
double MixtureRAS = (*MixtureEntries)[i].ProbeMean;
double PopMean = (*PopulationMean)[i];
double a = 0;
double b = 0;

/* check that we have genotypes assert( RAS > 0.99 11 RAS < 0.01 II (RAS > 0.49 &&
RAS < 0.51));

/* TODO: must assure that we aren't taking the log of a really, really small number, otherwise we have infinities if(RAS > 0.99) { /* we have two copies /* Under HWE: pAZ */
a = log(MixtureRAS*MixtureRAS);
b =log(PopMean*PopMean);
}
else if(RAS < 0.01) { /* we have zero copies /* Under HWE: qA2 where q=1-p */
a=log((1.0-MixtureRAS)*(1.0-MixtureRAS));
b=log((1.0-PopMean)*(1.0-PopMean));
}
else if(RAS > 0.49 && RAS < 0.51) { /* we have one copy */
/* Under HWE: Zpq where q=1-p a=log(Z.0*MixtureRAS*(1.0-MixtureRAS));
b=log(2.0*PopMean*(1.0-PopMean));
}
else {
fprintf(stderr, "Error: expecting genotypes in GetLikelihoodRation in Statistic.c. Terminating!\n");
exit(1);
}
if(1!=isinf(a) && 1!=isinf(b)) {
log_likelihood_mixture+=a;
log_likelihood_population+=b;
/* update the number of snps used n++;
}
}
}
(*NumSnpsUsed)=n;

/* compute posterior odds ration double A=log_likelihood_mixture/log_likelihood_population;
return -2.0*A;
}

#include <stdio.h>
#include <stdlib.h>
#include "GetHostMachineEndianness.h"
enum {Little, Big};

int GetEndian() {
int i = 0x12345678;
if (*(char*)&i==0x12) {
/* printf("Big endian\n"); */
return Big;
}
else if (*(char*)&i==0x78) {
/* printf("Little endian\n"); */
return Little;
}
else {
printf("You invented a new architecture! Congratulations. Start a company!");
exit(O);
}
}

Next stdio.h included because ofstream object did not take the binary write flag ios::binary on some gcc compilers.
So trying to use standard C functions for binary file i/o #include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

The following file contains host to network and network to host // byte order functions. The db files will ALWAYS be stored in network order (i.e., Big endian format) When danalyze program reads a db file, it __MUST_- convert // each value to the native machine using ntohl or ntohs functions // This conversion implies that the db files are portable across // Little Endian (x86) and Big Endian (PowerPC, SPARC) machines as long as danalyze uses appropriate ntohs family of functions #ifdef HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif #ifdef HAVE_SYS_SOCKET_H
#include <sys/socket.h>
#endif #ifdef HAVE_NETINET_H
#include<netinet/in.h>
#endif #include "main.h"
unsigned int GenerateCode(string);
Globals.
Each cpp file that needs to use any of these will need to locally define them using the extern keyword so the linker can resolve.
unsigned int DICEPHIACODE=O;
int VERBOSE = 0;
Next flag used for Illumina only, to skip negative values In the future, a command line flag may be used.
int NegativeValuesSkipFlag=1;
int main(int argc, char **argv) {
int c, OptErr=O;
char OptionString[]="c:i:l:f:q:s:e:d:C:D:L:S:hVFnv";
char Usage[]= " [options]\n\n"
Common Options "-c ChipType 0 = Affymetrix, 1 = Illumina (required)\n"
"-s SampleType Must be one of \"MixtureOflnterest \", \"ReferencePopulation\", \"PeopleOfInterest\", \"MeanPeople \" (required)\n"
"-n Normalize Normalize by dividing by the mean channel intensity (only do this if you are not using quantile normalization in danalyze)\n"
Affy specific options ---------------------------------------"-1 CELFileName CEL file to be processed\n"
"-f CDFFileName CDF file to match CEL file\n"
e EnzymeName Type of Enzyme. Nsp, Sty, Xba, 5.0 etc.\n"
"-D FilterPercent Filter lowest x% of snps based on PMA, PMB and combined intensities\n"
// "-m Extract both match and mismatch valuess\n"
Illumina specific options -------------------------------------"-i IlluminaFiles A file containing file names\n"
"-L FilterLimit Skip beads with intensity less than FilterLimit\n"
11 (default all beads included)\n"
"-S FilterStdDev Skip beads with intensity >
FilterDev standard\n"
deviations from channel mean (default 0)\n"
" C FilterPercent Filter lowest x% of beads based on individual and combined intensites\n"
(default all beads included)\n"
Output options "-d ExpDir Output directory. (default null).
If specified\n"
don\'t forget the last / E.g., -d DataFiles/\n"
"-F Do not write output files (optional)\n"
"-V Verbose mode (optional)\n"
"-v Print version number\n"
"-h Display this help screen\n";
The QuerySNP option has been temporaily removed because it relies on // the old verbose output. Will need to be refactored to independence.
// "-q QuerySnp ID of single SNP to extract (optional) \n"

int FilterPercent=0; -C for Illumina and -D for Affymetrix // Affy specific string CelFileName=""; -1 string CdfFileName=""; -f string EnzymeName=""; // -e boot ProcessMMFlag=false; -m option sets this to true // Illumina specific string IlluminaFiles=""; -i int FilterLimit=0; -L
double FilterStdDev=O; -S
string ExpDir=""; -d string ExperimentType=""; -s boot SuppressOutputFlag=false; -F option sets this to true int ChipType=-1; -c option sets this to 0 or 1 int Normalize=0;

The following variable QuerySnp is used in two ways:
1. To extract only those probesets that contain the word SNP
(thus discarding the control probes 2. To extract a specific SNP in which case the entire name needed e.g. QuerySnp = SNP_A-1780632 which can be obtained from the -s command line option.

string QuerySnp="SNP"; -q if(argc==1) {
cout << "\nUsage: " << argv[0] << Usage << endl;
exit(1);
}
string ErrorHeader="Unrecoverable Error - Aborting!\n";

This all needs to be refactored. We can read in the values and set various flags in the while(){} loop BUT we cannot do any output until after all arguments have been read. This is so that:
(a) the -q option can really mean quiet mode (b) we can output our diagnostics in a fixed order as opposed to the order arguments appear so we can have sensible indenting.
// The -h and -v options are exceptions and should output immediately.

while((!OptErr) && ((c = getopt (argc, argv, OptionString)) !_ -1)) {
switch (c) {
case 'c':
ChipType=(int)strtod(optarg, NULL); break;
case 'i':
IlluminaFiles=optarg; break;
case '1':
CelFileName=optarg; break;
case 'f':
CdfFileName=optarg; break;
case 'q':
QuerySnp=optarg; break;
case 'e':
EnzymeName=optarg; break;
case 'n':
Normalize=l; break;
case 's':
if( strcmp(optarg, "MixtureOfInterest") &&
strcmp(optarg, "ReferencePopulation") &&
strcmp(optarg, "PeopleOfInterest") &&
strcmp(optarg, "MeanPeople")){
cout << ErrorHeader << "Sample Type (-s) must be << "\"MixtureOfInterest\""
<< or \"ReferencePopulation\""
<< ", or \"PeopleOfInterest \..
<< ", or \"MeanPeople\""
<< endl;

exit(l);
}
ExperimentType=optarg; break;
case 'V':
VERBOSE++;
break;
case 'F':
SuppressOutputFlag=true; break;
1*
case 'm':
ProcessMMFlag=true; break;
case 'v':
cout << DEXTRACT_VERSION << "\n"
<< "Copyright 2006 Translational Genomics Research Institute."
<< endl; OptErr=1; break;
case 'L':
FilterLimit=(int)strtod(optarg, NULL);
break;
case 'S':
FilterStdDev=(int)strtod(optarg, NULL);
break;
case 'd':
ExpDir=optarg; break;
case 'h':
cout << "Usage: " << argv[0] << Usage <<
endl; OptErr=1; break;
case 'C':
FilterPercent=(int)strtod(optarg, NULL);
break;
case 'D':
FilterPercent=(int)strtod(optarg, NULL);
break;
default:
cout << "Usage: " << argv[0] << Usage <<
endl; OptErr=1;
}
}

if(OptErr) exit(1);

// Print parameters if VERBOSE is set if ( VERBOSE >= 1 ) {
cout << "Parameters:\n";
cout << " -c ChipType " << ChipType endl;
if ( strlen( EnzymeName.c_str() ) > 0 ) tout << " -e EnzymeType " << EnzymeName <<
endl;
if ( strlen( CelFileName.c__str() ) > 0 ) tout << " -l CELFileName " << CelFil.eName <<
endl;
if C strlen( CdfFileName.c_str() ) > 0 ) tout << " -f CDFFileName " << CdfFileName <<
endl;
if ( Normalize 0 ) cout << " -n Normalize " << Normalize << endl;
if ( strlen( ExperimentType.c_str() ) > 0 ) cout << " -s SampleType " << ExperimentType <<
endl;
if ( strlen( IlluminaFiles.c_str() ) > 0 ) tout << " -i IlluminaFiles " << IlluminaFiles <<
endl;
if ( strlen( ExpDir.c_str() ) > 0 ) cout << " -d ExpDir " << ExpDir <<
endl;
if ( FilterLimit != 0 ) tout << " -L FilterLimit " << FilterLimit <<
endl;
if ( FilterStdDev != 0 ) cout << " -S FilterStdDev " << FilterStdDev <<
endl;
if ( FilterPercent != 0) tout << " -C or -D FilterPercent " <<
FilterPercent << endl;
cout << " -V " << VERBOSE <<
endl;
// Print out any flags that were set if ( SuppressOutputFlag 11 ProcessMMFlag ) {
tout << " Flags set: ";
if ( SuppressOutputFlag ) tout << "-F ";
if ( ProcessMMFlag ) tout << "-m ";
tout << endl;
}
}

Some parameters are required regardless of chip type if ( strlen(ExperimentType.c_str()) == 0 ) {
cout << ErrorHeader << "Option -s is required in all cases"
<< endl;

exit(1);
}
if( Normalize != 0 && Normalize != 1) {
cout << ErrorHeader << "Option -n with Normalize should be zero or one" << endl;
exit(1);
}
// Check Affymetrix parameters if(ChipType == 0) {
if ( strlen(IlluminaFiles.c_str()) != 0 ) {
cout << ErrorHeader << "Option -i cannot be used with -c " <<
ChipType << endl;
exit(1);
}
if (FilterLimit!=O) {
cout << ErrorHeader << "Option -L cannot be used with -c " <<
ChipType << endl;
exit(1);
}
if (FilterStdDev!=0) {
cout << ErrorHeader << "Option -S cannot be used with -c " <<
ChipType << endl;
exit(1);
}
if(O > FilterPercent II 100 < FilterPercent) {
cout << ErrorHeader << "Option -D must be between 0 and 100 -you supplied ["
<< FilterPercent << "]" << endl;
exit(1);
}

If any of (-l -f -e) is missing then we have a problem if (( strlen(CelFileName.c_str()) == 0 ) II
( strlen(CdfFileName.c_str()) 0 ) II
( strlen(EnzymeName.c_str() 0)) {
cout << ErrorHeader << "One or more of the required options (-1 -f -e)\n"
<< "for Affymetrix chip data extractions is missing\n";
exit(1);
}

// If we passed all the tests then calculate the Code DICEPHIACODE=GenerateCode(EnzymeName);
}

Check Illumina parameters else if (ChipType == 1) {
if ( strlen(CelrileName.c_str()) != 0 ) {
cout << ErrorHeader << "Option -1 cannot be used with -c " <<
ChipType << endl;
exit(l);
}
if ( strlen(CdfFileName.c_str()) != 0 ) {
cout << ErrorHeader << "Option -d cannot be used with -c " <<
ChipType << endl;
exit(1);
}
if ( strlen(EnzymeName.c_strQ) != 0 ) {
cout << ErrorHeader << "Option -t cannot be used with -c " <<
ChipType << endl;
exit(1);
}
if (ProcessMMFlag) {
cout << ErrorHeader << "Option -m cannot be used with -c " <<
ChipType << endl;
exit(1);
}
if (FilterLimit<0) {
cout << ErrorHeader << "Option -L must be a positive integer -you supplied ["
<< FilterLimit << "]" << endl;
exit(1);
}
if(FilterStdDev<0) {
cout << ErrorHeader << "Option -S must be a positive integer -you supplied ["
<< FilterStdDev << "]" << endl;
exit(1);
}
if(0 > FilterPercent II 100 < FilterPercent) {

cout << ErrorHeader << "Option -C must be between 0 and 100 -you supplied ["
<< FilterPercent << "]" << endl;
exit(1);
}

If any of (-i) is missing then we have a problem if (( strlen(IlluminaFiles.c_str()) == 0)) {
cout << ErrorHeader << "One or more of the required options (-i)\n"
<< "for Illumina chip data extractions is missing\n";
exit(1);
}

If we passed all the tests then calculate the Code DICEPHIACODE=GenerateCode("ILLUMINA");
}
else {
cout << ErrorHeader << "Must specify a ChipType 0 (Affymetrix) or 1 (Illumina) "
<< "using -c" << endl;
exit(1);
}

Print out warnings for any parameters that were not processed for (int index=optind;index<argc;index++) {
printf ("Ignored argument(s) (possibly) due to error(s): %s \n", argv[index]);
}

If we got this for then we have passed all the tests for commandline parameters so we are ready to start work!

if ( VERBOSE >= 1 ) {
cout << "Processing:\n";
cout << " Deciphia generated SerialNumber prefix:
<< DICEPHIACODE << endl;
if (ProcessMMFlag) {
cout << " Note: Extracting mismatch values in addition to perfect << "match values."
<< endl;

}
if (QuerySnp != "SNP") {
cout << " Note: Only processing SNP: " << QuerySnp << endl;
}
if (SuppressOutputFlag) {
cout << " Note: No output files written (-F flag specified)."
<< endl;
}
}

Constructing the TGen header:
Write MagicNumber in the first two bytes.
Next 2 bytes set to zero for the time being. It will be populated // later using version number etc.
int MagicNumber=MAGIC_ID;
int VersionNumber=0;
uint32_t Header=0;
Header=MagicNumber*256*256+VersionNumber*256;
Header=htonl(Header);
if ( VERBOSE >= 1 ) {
cout << " Magic Number: " << MagicNumber << endl;
cout << " Header byte: " << Header << endl;
cout << " Network order header byte: " << Header << endl;
}

if(ChipType == 0) {
/* Process Affymetrix Chips ExtractPMlntensities( CelFileName, CdfFileName, EnzymeName, ExperimentType, ExpDir, SuppressOutputFlag, ProcessMMFlag, QuerySnp, MagicNumber, VersionNumber, Header, ChipType, Normalize, FilterPercent);
}

else {
/* Process Illumina Chips */
ExtractGRlntensities( IlluminaFiles, ExperimentType, ExpDir, Suppress0utputFlag, QuerySnp, MagicNumber, VersionNumber, Header, ChipType, Normalize, FilterLimit, FilterStdDev, FilterPercent);
}
if (VERBOSE >= 1) cout << "Processing Completed." <<endl;
return 0;
}
unsigned int GenerateCode(string str) {

A very simple way of creating a "code" based on the type of SNP.
Will be used to create the SerialNumber for each SNP
while writing Affy SNPs.
For Illumina, all SNPs names are supplied externally and this is step not needed.

int i;
unsigned int Code=O;
// cout<<"Received "<<str<<" with length "<<str.lengthQ<<endl;
for(i=0;
i<(int)((str.length Q<3)?str.length():3);
i++) {
Code+=128*(str[i]-' ');
}
cout<<"Code is "<<Code<<endl;
Code%=1000;
cout<<"Code is "<<Code<<endl;
return Code;
}

INTENTIONALLY LEFT BLANK

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <ctype.h>
#include <limits.h>
#ifdef HAVE_CONFIG_H
#include <config.h>
#endif Based on GNU/Linux and glibc or not, argp.h may or may not be available.
If it is not, fall back to getopt. Also see #ifdefs in the Parselnput.h file.

#ifdef HAVE_ARGP_H
#include <argp.h>
#define OPTARG arg #elif defined HAVE_UNISTD_H
#include <unistd.h>
#define OPTARG optarg #else #include "GetOpt.h"
#define OPTARG optarg #endif #ifdef HAVE_SYS_TYPES_H
#include <sys/types.h> /* For u_int etc.
#endif #ifdef HAVE_SYS_TIME_H
#include <sys/time.h> /* For Mac OS X with resource.h #endif #ifdef HAVE_RESOURCE_H
#include <sys/resource.h>
#endif #include "Parselnput.h"
#include "Analyze.h"

const char *argp_program_version =
"danalyze version 0.1.1\n"
"Copyright 2007.";

const char *argp_program_bug_address =

"Nils Homer <nhomer@tgen.org>";

/*
OPTIONS. Field 1 in ARGP.
Order of fields: {NAME, KEY, ARG, FLAGS, DOC, OPTIONAL_GROUP-NAME}.
enum {
DesclnputFilesTitle, DescMixtureOflnterestFileName, DescReferencePopulationListFileName, DescPeopleOflnterestListFileName, DescSnpNamesFileName, DescMeanPeopleListFileName, DescAlgoTitle, DescTestStatistic, DescDistanceMeasure, DescCorrelationXDistance, DescCorrelationYDistance, DescNormalizeChips, DescNormalizeTestStatistic, DescOutputTitle, DescOutputFileName, DescPrintSummary, DescMiscTitle, DescDisplayDistanceMeasures, DescHelp };

The prototype for argp_option comes fron argp.h. If argp.h absent, then Parselnput.h declares it static struct argp_option options[]
{0, 0, 0, 0, Input Files ----------------------------------------------------{"MixtureOflnterestFileName", 'i', "MixtureOflnterestFileName", 0, "Specifies the Mixture of Interest (Currently only one file)", 1}, {"ReferencePopulationListFileName", 'r', "ReferencePopulationListFilename", 0, "Specifies the Reference Population List of Files", 1}, {"PeopleOflnterestListFileName", 'p', "PeopleOflnterestListFileName", 0, "Specifies the People of Interest List of Files", 1}, {"SnpNamesFileName", 'n', "SnpNamesFileName", 0, "Specifies the Snp Names File (required)", 1}, {"MeanPeopleListFileName", 'm', "MeanPeopleListFileName", 0, "Specifies the People to use as the mean when normalizing (List of Files)", 1}, {0, 0, 0, 0, Algorithm Options: (Unless specified, default value = 1) 2}, {"TestStatistic", 't', "TestStatistic", 0, "1: Mean Difference 2:
Correlation (Pearson) 3: Correlation (Spearman) 4: Wilcoxon Sign Rank Test", 2}, {"DistanceMeasure", 'd', "DistanceMeasure", 0, "Options 1-6 (Used with Test Statistics 1, 4). Use -D to see all options", 2}, {"CorrelationXDistance", 'x', "CorrelationXDistance", 0, "Options 1-11 (Used with Test Statistics 2, 3). Use -D to see all options", 2}, {"CorrelationYDistance", 'y', "CorrelationYDistance", 0, "Options 1-11 (Used with Test Statistics 2, 3). Use -D to see all options", 2}, {"NormalizeChips", 'c', 0, OPTION_NO_USAGE, "Normalize all raw chip files but not genotyped data (default False)", 2}, {"NormalizeTestStatistic", 'z', 0, OPTION_NO_USAGE, "Normalize Test Statistic based on reference population distribution (default False)", 2}, {0, 0, 0, 0, Output Options ------------------------------------------- ------{"OutputFileName", 'o', "OutputFileName", 0, "Specifies the output file name", 3}, {"PrintSummary", 's', "PrintSummary", 0, "Print a summary of Results", 3}, {0, 0, 0, 0, Miscellaneous Options ---------------------------------41, {"DisplayDistanceMeasures", 'D', 0, OPTION_NO_USAGE, "Displays the different options for use with DistanceMeasure, CorrelationXDistance, and CorrelationYDistance", 4}, {"Help", 'h', "Help", OPTION_NO_USAGE, "Display usage summary", 4}, {0, 0, 0, 0, 0, 0}
};

ARGS_DOC. Field 3 in ARGP.
A description of the non-option command-line arguments that we accept.
Not complete yet. So empty string static char args_doc ^

DOC. Field 4 in ARGP. Program documentation.

static char doc[] ="This program was created by Nils Homer and is not intended for distribution.";

#ifdef HAVE_ARGP_H
/*
The ARGP structure itself.

static struct argp argp = {options, parse-opt, args_doc, doc};
#else /* argp.h support not available! Fall back to getopt static char OptionString[]=
"d:i:m:n:o:p:r:s:t:x:y:chzD";
#endif enum {ExecuteDisplayDistanceMeasures, ExecuteGetOptHelp, ExecuteProgram};

The main function. All command-line options parsed using argp_parse or getopt whichever available int main (int argc, char **argv) {
struct arguments arguments;
if(argc>1) {
/* Set argument defaults. (overriden if user specifies them) */
AssignDefaultValues(&arguments);
/* Parse command line args #ifdef HAVE_ARGP-H
if(argp_parse(&orgp, argc, argv, 0, 0, &arguments)==0) #else if(getopt_parse(argc, argv, OptionString, &arguments)==0) #endif {
switch(arguments.ProgramMode) {
case ExecuteDisplayDistanceMeasures:
PrintDistanceMeasures(stderr);
break;
case ExecuteGetOptHelp:
PrintProgramParameters(stderr, &arguments);
break;
case ExecuteProgram:
if(ValidateInputs(&arguments)) {
fprintf(stderr, "**** Input arguments look good! *****\n");
fprintf(stderr, BREAK_LINE);
}
else {
fprintf(stderr, "PrintError validating command-line inputs. Terminating!\n");
exit(1);
}

PrintProgramParameters(stderr, &arguments);

/* Execute Program Analyze(arguments. MixtureOflnterestFileName, arguments. ReferencePopulationListFileName, arguments. PeopleOflnterestListFileName, arguments.TestStatistic, arguments. NormalizeTestStatistic, arguments.SnpNamesFileName, arguments. NormalizeChips, arguments.MeanNormalize, arguments.MeanPeopleListFileName, arguments. OutputFileName, arguments. DistanceMeasure, arguments. CorrelationXDistance, arguments. CorrelationYDistance, arguments.PrintSummary);
break;
default:
fprintf(stderr, "PrintError determining program mode. Terminating!\n");
exit(l);
}
}
else {
fprintf(stderr, "PrintError parsing command line arguments!\n");
exit(1);
}
}
else {
GetOptHelp();
#ifdef HAVE_ARGP H
/* fprintf(stderr, "Type \"%s --help" to see usage\n", argv[0]);
#else /* fprintf(stderr, "Type \"%s -h\" to see usage\n", argv[0]); */
#endif }
return 0;
}

int Validatelnputs(struct arguments *CommandLineArg) {
char *FnName="ValidateInputs";

fprintf(stderr, BREAK_LINE);
fprintf(stderr, "Checking input parameters supplied by the user ...
\n");

if((*CommandLineArg).MixtureOfInterestFileName!=0) {
fprintf(stderr, "Validating MixtureOflnterestFileName filename %s. \n", (*CommandLineArg).MixtureOflnterestFileName);
if(ValidateFileName((*CommandLineArg).MixtureOfInterestFileName)==0) PrintError(FnName, "MixtureOfInterestFileName", "Command line argument", 3, 0);
}
if((*CommandLineArg).ReferencePopulationListFileName!=0) {
fprintf(stderr, "Validating ReferencePopulationListFileName filename %s. \n", (*CommandLineArg).ReferencePopulationListFileName);
if(VolidateFileName((*CommandLineArg).ReferencePopulationListFileName)==0) PrintError(FnName, "ReferencePopulationListFileName", "Command line argument", 3, 0);
}

if((*CommandLineArg).PeopleOflnterestListFileName!=0) {
fprintf(stderr, "Validating PeopleOflnterestListFileName filename %s. \n", (*CommandLineArg). PeopleOflnterestListFileName);
if(ValidateFileName((*CommandLineArg). PeopleOflnterestListFileName)==0) PrintError(FnName, "PeopleOfInterestListFileName", "Command line argument", 3, 0);
I.
if((*CommandLineArg).TestStatistic < MIN-TEST-STATISTIC 11 (*CommandLineArg).TestStatistic > MAX_TEST_STATISTIC) {
PrintError(FnName, "TestStatistic", "Command line argument", 2, 0);
}
if((*CommandLineArg).NormalizeTestStatistic < 0 II
(*CommandLineArg).NormalizeTestStatistic > 1) {
PrintError(FnName, "NormalizeTestStatistic", "Command line argument", 2, 0);
}
if((*CommandLineArg).SnpNamesFileName!=0) {
fprintf(stderr, "Validating SnpNamesFileName path %s. \n", (*CommandLineArg).SnpNamesFileName);
if(ValidateFileName((*CommandLineArg).SnpNamesFileName)==0) PrintError(FnName, "SnpNamesFileName", "Command line argument", 3, 0);
}
if((*CommandLineArg).MeanNormalize!=0 &&
(*CommandLineArg).MeanNormalize!=1) {
PrintError(FnNome, "MeanNormalize", "Command line argument", 3, 0);
}
if((*CommandLineArg).MeanPeopleListFileName!=0) {
fprintf(stderr, "Validating MeanPeopleListFileName path %s.
\n", (*CommandLineArg).MeanPeopleListFileName);
if(ValidateFileName((*CommandLineArg).MeanPeopleListFileName)==0) PrintError(FnName, "MeanPeopleListFileName", "Command line argument", 3, 0);
}
if((*CommandLineArg).DistanceMeasure < MIN-DISTANCE-MEASURE 11 (*CommandLineArg). DistanceMeasure > MAX_DISTANCE_MEASURE) {
if((*CommandLineArg).TestStatistic == 1 II
(*CommandLineArg).TestStatistic == 4) {
PrintError(FnName, "DistanceMeasure", "Command line argument", 2, 0);

}
}
if((*CommandLineArg).CorrelationXDistance <
MIN_CORRELATION_DISTANCE II (*CommandLineArg).CorrelationXDistance >
MAX_CORRELATION_DISTANCE) {
if((*CommandLineArg).TestStatistic == 2 II
(*CommandLineArg).TestStatistic == 3) {
PrintError(FnName, "CorrelationXDistance", "Command line argument", 2, 0);
}
}

if((*CommandLineArg).CorrelationYDistance <
MIN_CORRELATION_DISTANCE II (*CommandLineArg).CorrelationYDistance >
MAX_CORRELATION_DISTANCE) {
if((*CommandLineArg).TestStatistic == 2 II
(*CommandLineArg).TestStatistic == 3) {
PrintError(FnName, "CorrelationYDistance", "Command line argument", 2; 0);
}
}

if((*CommandLineArg).NormalizeChips!=0 &&
(*CommandLineArg).NormalizeChips!=1) {
PrintError(FnName, "NormalizeChips", "Command line argument", 2, 0);
IF
if((*CommandLineArg).OutputFileName!=0) {
fprintf(stderr, "Validating OutputFileName path %s. \n", (*CommandLineArg).OutputFileName);
if(ValidateFileName((*CommandLineArg).OutputFileName)==0) PrintError(FnName, "OutputFileName", "Command line argument", 3, 0);
}
if((*CommandLineArg).PrintSummary != 0 &&
(*CommandLineArg).PrintSummary != 1) {
PrintError(FnName, "PrintSummary", "Command line argument", 3, 0);
}
return 1;
}

int ValidateFileName(char *Name) {

Checking that strings are good: FileName = [a-zA-Z_0-9][a-zA-Z0-9-.]+
FileName can start with only [a-zA-Z_0-9]
char *ptr=Name;
int counter=0;
/* fprintf(stderr, "Validating FileName %s with length %d\n", ptr, strlen(Name)); */

assert(ptr!=0);
while(*ptr) {
if((isalnum(*ptr) II (*ptr=='_') Il (*ptr=='+') II
((*ptr=='.') /* && (counter>O)*/) II /* FileNames can't start with . or - */
((*ptr=='/')) II /* Make sure that we can navigate through folders */
((*ptr=='-') && (counter>O)))) {
ptr++;
counter++;
}
else return 0;
}
return 1;
}

void AssignDefaultValues(struct arguments *args) {
/* Assign default values */
(*args).ProgramMode = ExecuteProgram;
(*args).MixtureOfInterestFileName=
(char*)malloc(sizeof(DEFAULT_FILENAME));
assert((*args).MixtureOfInterestFileName!=0);
strcpy((*args).MixtureOfInterestFileName, DEFAULT_FILENAME);
(*args). ReferencePopulationListFileName=
(char*)malloc(sizeof(DEFAULT_FILENAME));
assert((*args). ReferencePopulationListFileName!=0);

strcpy((*args).ReferencePopulationListFileName, DEFAULT_FILENAME);

(*args). PeopleOfInterestListFileName=
(char*)malloc(sizeof(DEFAULT_FILENAME));
assert((*args). PeopleOfInterestListFileName!=0);
strcpy((*args). PeopleOfInterestListFileName, DEFAULT_FILENAME);

(*args). SnpNamesFileName =
(char*)malloc(sizeof(DEFAULT_FILENAME));
assert((*args). SnpNamesFileName!=0);
strcpy((*args).SnpNamesFileName, DEFAULT_FILENAME);
(*args).MeanNormalize=0;
(*args). MeanPeopleListFileName=
(char*)malloc(sizeof(DEFAULT_FILENAME));
assert((*args).MeanPeopleListFileName!=0);
strcpy((*args). MeanPeopleListFileName, DEFAULT_FILENAME);
(*args).TestStatistic = 0;
(*args).DistanceMeasure = 0;
(*args).CorrelationXDistance = 0;
(*args).CorrelationYDistance = 0;
(*args).NormalizeChips = 0;
(*args).NormalizeTestStatistic = 0;
(*args).OutputFileName =
(char*)malloc(sizeof(DEFAULT_FILENAME));
assert((*args). OutputFileName!=0);
strcpy((*args).OutputFileName, DEFAULT_FILENAME);
(*args).PrintSummary = 0;

return;
}

void PrintProgramParameters(FILE* fp, struct arguments *args) {
char truefalse[2][16] = {"False", "True"};
char programmode[3][64] = {"ExecuteDisplayDistanceMeasures", "ExecuteGetOptHelp", "ExecuteProgram"};
fprintf(fp, BREAK_LINE);
fprintf(fp, "Printing Program Parameters:\n");
fprintf(fp, "ProgramMode:\t\t\t\t%d\t[%s]\n", (*args). ProgramMode, programmode[(*args). ProgramMode]);
fprintf(fp, "MixtureOfInterestFileName:\t\t%s\n", (*args).MixtureOfInterestFileName);
fprintf(fp, "ReferencePopulationListFileName:\t%s\n", (*args).ReferencePopulationListFileName);
fprintf(fp, "PeopleOfInterestListFileName:\t\t%s\n", (*args). PeopleOfInterestListFileName);
fprintf(fp, "SnpNamesFileName:\t\t\t%s\n", (*args). SnpNamesFileName);
fprintf(fp, "MeanPeopleListFileName:\t\t\t%s\t[Will Use: %s]\n", (*args).MeanPeopleListFileName, truefalse[(*args).MeanNormalize]);
fprintf(fp, "TestStatistic:\t\t\t\t%d\n", (*args).TestStatistic);
fprintf(fp, "DistanceMeasure:\t\t\t%d\n", (*args).DistanceMeasure);
fprintf(fp, "CorrelationXDistance:\t\t\t%d\n", (*args). CorrelationXDistance);
fprintf(fp, "CorrelationYDistance:\t\t\t%d\n", (*args). CorrelationYDistance);
fprintf(fp, "NormalizeChips:\t\t\t\t%d\n", (*args). NormalizeChips);
fprintf(fp, "NormalizeTestStatistic:\t\t\t%d\n", (*args). NormalizeTestStatistic);
fprintf(fp, "OutputFileName:\t\t\t\t%s\n", (*args).OutputFileName);
fprintf(fp, "PrintSummary:\t\t\t\t%d\n", (*args).PrintSummary);
fprintf(fp, BREAK_LINE);
return;
}
void PrintDistanceMeasures(FILE* fp) {
fprintf(fp, BREAK_LINE);
fprintf(fp, "Definitions:\n");
fprintf(fp, "Freq = Person of Interest RAS value\n");
fprintf(fp, "MixtureFreq = Mixture RAS value\n");
fprintf(fp, "PopulationMean = Population Mean RAS value\n");
fprintf(fp, "MixtureDiff = IFreq - MixtureFregl\n");
fprintf(fp, "PopDiff = IFreq - PopulationMeanl\n");
fprintf(fp, BREAK_LINE);

fprintf(fp, BREAK_LINE);
fprintf(fp, "Printing Distance Measures for -d (DistanceMeasures) \n");
fprintf(fp, "1: = PopDiff - MixtureDiff\n");
fprintf(fp, "2: signed version of option 1 (either one, zero or negative one)\n");
fprintf(fp, "3: = (PopDiff - MixtureDiff)/MixtureFreq (assuming MixtureFreq < 0.5) otherwise (PopDiff - MixtureDiff)/(1-MixtureFreq)\n");
fprintf(fp, "4: = (PopDiff - MixtureDiff)/(MixtureFreq*(1.0-MixtureFreq))\n");
fprintf(fp, "5: = Freq - MixtureFreq (note: no absolute values) \n");
fprintf(fp, "6: = lFreq - MixtureFreql\n");
fprintf(fp, BREAK_LINE);

fprintf(fp, BREAK_LINE);
fprintf(fp, "Printing Distance Measures for -x (CorrelationXDistance) and -y (CorrelationYDistance)\n");
fprintf(fp, "1: MixtureFreq - PopulationMean\n");
fprintf(fp, "2: signed version of option 1 (either one, zero or negative one)\n");
fprintf(fp, "3: Freq - PopulationMean\n");
fprintf(fp, "4: signed version of option 3 (either one, zero or negative one)\n");
fprintf(fp, "5: PopulationMean - MixtureFreq\n");
fprintf(fp, "6: signed version of option 5 (either one, zero or negative one)\n");
fprintf(fp, "7: Freq - MixtureFreq\n");
fprintf(fp, "8: signed version of option 7 (either one, zero or negative one)\n");
fprintf(fp, "9: Freq\n");
fprintf(fp, "10: MixtureFreq\n");
fprintf(fp, "11: PopulationMean\n");
fprintf(fp, BREAK_LINE);
}
void GetOptHelp() {

struct argp_option *a=options;
fprintf(stderr, "\nUsage: danalyze [options]\n");
while((*a).group>0) {
switch((*a).key) {
case 0:
fprintf(stderr, "\n%s\n", (*a).doc); break;
default:
fprintf(stderr, "-%c\t%1Zs\t%s\n", (*a).key, (*a).arg, (*a).doc); break;
}
a++;
}
return;
}

void PrintSysteminfo() {

fprintf(stderr, "Machine Endianess (0: little, 1: big) = %d\n", GetEndian());
#ifdef HAVE_RESOURCE_H
fprintf(stderr, "uname -a output is \n");
system("uname -a");
#endif fprintf(stderr, "Size of integer=%d. If this is not 4, send email to %s.\n", (int)sizeof(int), argp_program_bug_address);
fprintf(stderr, "Size of unsigned short int=%d.\nIf this is not 2, send email to %s.\n", (int)sizeof(short int), argp_program_bug_address);
#ifdef HAVE_CONFIG_H
fprintf(stderr, "Integer sizes in bytes : \nlong int = %d\nint = %d \nshort int = %d\n", SIZEOF_LONG_INT, SIZEOF_INT, SIZEOF_SHORT_INT);
#endif }
void PrintGetOptHelp() {

struct argp_option *a=options;
fprintf(stderr, "%s\n", argp_program_version);
fprintf(stderr, "\nUsage: danalyze [options]\n");
while((*a).group>0) {
switch((*a).key) {
case 0:
fprintf(stderr, "\n%s\n", (*a).doc); break;
default:
fprintf(stderr, "-%c\t%12s\t%s\n", (*a).key, (*a).arg, (*a).doc); break;
}
}
fprintf(stderr, "\n%s\n", argp_progrom_bug_address);
return;
}
#ifdef HAVE_ARGP_H
static error_t parse-opt (int key, char *arg, struct argp_state *state) {
struct arguments *arguments = state->input;
#else int getopt_parse(int argc, char** argv, char OptionString[], struct arguments* arguments) {
char key;
int OptErr=O;
while((OptErr==O) && ((key = getopt (argc, argv, OptionString)) -1)) {
/*
fprintf(stderr, "Key is %c and OptErr =
%d\n", key, OptErr);

#endif switch (key) {
case 'c':
arguments->NormalizeChips=l;break;
case 'd':
arguments->DistanceMeasure=atoi(OPTARG);break;
case 'h':
arguments->ProgramMode=ExecuteGetOptHelp; break;
case 'i':
if(arguments->MixtureOfInterestFileName) free(arguments->MixtureOfInterestFileName);
arguments->MixtureOfInterestFileName=OPTARG;break;
case 'm':
arguments->MeanNormalize=l;
if(arguments->MeanPeopleListFileName) free(arguments->MeanPeopleListFileName);
arguments->MeanPeopleListFileName=OPTARG; break;
case 'n':
if(arguments->SnpNamesFileName) free(arguments->SnpNamesFileName);
arguments->SnpNamesFileName=OPTARG; break;
case '0':
if(arguments->OutputFileName) free(arguments->OutputFileName);
arguments->OutputFileName =
OPTARG;break;
case 'p':
if(arguments->PeopleOfInterestListFileName) free(arguments->PeopleOfInterestListFileName);
arguments->PeopleOfInterestListFileName=OPTARG; break;
case 'r':
if(arguments->ReferencePopulationListFileName) free(arguments->ReferencePopulationListFileName);
arguments->ReferencePopulationListFileName=OPTARG; break;
case 's':
arguments->PrintSummary=atoi(OPTARG);break;
case 't':
arguments->TestStatistic=atoi(OPTARG); break;
case 'x':
arguments->CorrelationXDistance=atoi(OPTARG);break;
case 'y':
arguments->CorrelationYDistance=atoi(OPTARG); break;
case 'z':
arguments->NormalizeTestStatistic=l; break;
case 'D':
arguments->ProgramMode=ExecuteDisplayDistanceMeasures;break;
default:
#ifdef HAVE_ARGP_H
return ARGP_ERR_UNKNOWN;
} /* switch */
return 0;
#else OptErr=l;
} /* while */
} /* switch */
return OptErr;
#endif }

#include "utils.h"

// Globals defined elsewhere extern int VERBOSE;

void InitializeHeader(FILE * CurrentOutputFile, uint32_t Header, uint32_t SNPCount, int ChipType, bool ProcessMMFlag, bool SingleSnpMode, int Normalize) {
uint32_t tempChipType = htonl( ((uint32_t)ChipType) );
uint32_t tempProcessMMFlag = htonl( ((uint32_t)ProcessMMFlag) );
uint32_t tempSingleSnpMode = htonl( ((uint32_t)SingleSnpMode) );
SNPCount=htonl(SNPCount);
uint32_t tempNormalize = htonl( ((uint32_t)Normalize) )=
uint32_t tempAverageChannell = htonl((uint32_t)0);
uint32_t tempAverageChannel2 = htonl((uint32_t)0);

if ( VERBOSE >= 1 ) cout<<" Single Snp Mode: " << SingleSnpMode << endl;
Writing the first byte of the header fwrite(&Header, sizeof(uint32_t), 1, CurrentOutputFile);
Writing the second byte = zero for the time being. Once all SNPs are seen, this will store the number of SNPs fwrite(&SNPCount, sizeof(uint32_t), 1, CurrentOutputFile);
// Writing the third byte. This will store the type of chip that this file corresponds to.
fwrite(&tempChipType, sizeof(uint32_t), 1, CurrentOutputFile);
// Writing the fourth byte. This will store whether Mismatch Values were included.
// This is relevant only for Affymetrix chips.
fwrite(&tempProcessMMFlag, sizeof(uint32_t), 1, CurrentOutputFile);
// Writing the fifth byte. This will store whether we processed only one snp.
fwrite(&tempSingleSnpMode, sizeof(uint32_t), 1, CurrentOutputFile);
// Writing the sixth byte. This will store whether normalization has been performed.
fwrite(&tempNormalize, sizeof(uint32_t), 1, CurrentOutputFile);
// Writing the seventh byte. This will store the average intensity of Channel 1.
fwrite(&tempAverageChannell, sizeof(uint32_t), 1, CurrentOutputFile);

// Writing the eight byte. This will store the average intensity of Channel 2.
fwrite(&tempAverageChannel2, sizeof(uint32_t), 1, CurrentOutputFile);
}
void WriteResultsToHeader(FILE * CurrentOutputFile, uint32_t Header, uint32_t SNPCount, uint32_t AverageChannell, uint32_t AverageChannel2) {
if (VERBOSE >= 2) {
cout << " Size of uintl6_t: "
<< sizeof(uintl6_t) << " bytes\n";
cout << " Size of unsigned short int: "
<< sizeof(unsigned short int) << " bytes\n";
cout << " Size of unint32_t: "
<< sizeof(uint32_t)<<" bytes\n";
cout << " Size of unsigned int: "
<< sizeof(unsigned int) << " bytes\n";
}

Before closing the file, fill in the second byte of the db file Set pointer to the byte immediately after the Header variable fseek(CurrentOutputFile, (long int)sizeof(Header), SEEK_SET);
if (VERBOSE >= 1) cout<<" Writing SnpCount to byte "<<ftell(Current0utputFile) <<" of binary file"<<endl;

Now write the SNPCount here. So every db file has the number of SNPs stored in the second byte if (VERBOSE >= 1) cout << " Number of SNPs seen: "<<SNPCount<<endl;
SNPCount=htonl(SNPCount);
fwrite(&SNPCount, sizeof(uint32_t), 1, CurrentOutputFile);
if (VERBOSE >= 1) cout << " Average Channel 1: " << AverageChannell;
AverageChannell=htonl(AverageChannell);
fseek(CurrentOutputFile, (long int)(sizeof(uint32_t)*6), SEEK_SET);
fwrite(&AverageChannell, sizeof(uint32_t), 1, CurrentOutputFile);
if (VERBOSE >= 1) cout << " Average Channel 2: " << AverageChannel2;
AverageChannel2=htonl(AverageChannel2);
fseek(CurrentOutputFile, (long int)(sizeof(uint32_t)*7), SEEK_SET);
fwrite(&AverageChannel2, sizeof(uint32_t), 1, CurrentOutputFile);
}
void sort(int * v, int low, int high, int Add) {
/* MergeSort!
int mid = (low + high)/2;
int start-upper = mid + 1;
int end-upper = high;
int start-lower = low;
int end-lower = mid;
int ctr, i;
int * temp-entries;
char Rotate[]="II**////++--**\\++";
if(low >= high) {
if(low%1000==0) {
fprintf(stderr, "\b\b\b\b\b\b\b\b\b\b\b\b%c %10d", Rotate[(low+1)/100%16], Add+low+1);
}
return;
}

/* Partition the list into two lists and then sort them recursively sort(v, low, mid, Add);
sort(v, mid+1, high, Add);

temp-entries = (int*)malloc(sizeof(int)*(high-low+1));
/* Merge the two lists */
ctr = 0;
while( (start_lower<=end_lower) && (start_upper<=end_upper) ) {
if(v[start_lower] <= v[start_upper]) {
temp_entries[ctr] = v[start_lower];
start_lower++;
}
else {
temp_entries[ctr] = v[start_upper];
start_upper++;
}
ctr++;
}

if(start_lower<=end_lower) {
while(start_lower<=end_lower) {
temp_entries[ctr] = v[start_lower];
ctr++;
start_lower++;
}
}
else {
while(start_upper<=end_upper) {
temp_entries[ctr] = v[start_upper];
ctr++;
start_upper++;
}
}
for(i=low, ctr=0;i<=high;i++, ctr++) {
v[i] = temp_entries[ctr];
}
free(temp_entries);
}

/* -*- C -*-* Mathlib A C Library of Special Functions * Copyright (C) 1998-2003 The R Development Core Team * Copyright (C) 2004 The R Foundation *
* This program is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * (at your option) any later version.
*
* This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA

*
* Rmath.h should contain ALL headers from R's C code in 'src/nmath' ------- such that ''the Math library'' can be used by simply #include <Rmath.h>

and nothing else.
#ifndef RMATH_H
#define RMATH_H
#ifdef __cplusplus extern "C" {
#endif /*-- Mathlib as part of R -- define this for standalone /* #undef MATHLIB_STANDALONE */
#define MATHLIB_STANDALONE 1 #define R_VERSION_STRING "2.4.1"
#ifndef HAVE_LOG1P
# define HAVE_LOG1P 1 #endif #ifndef HAVE EXPM1 # define HAVE_EXPM1 1 #endif #ifndef HAVE_WORKING_LOG1P
# define HAVE_WORKING_LOG1P 1 #endif #ifndef HAVE_WORKING_LOG
# define HAVE-WORKING-LOG 1 #endif #include <errno.h>
#include <limits.h>
#include <float.h>
#include <math.h>

#if defined(HAVE_LOG1P) && !defined(HAVE_WORKING_LOG1P) /* remap to avoid problems with getting the right entry point double Rloglp(double);
#define loglp Rloglp #endif #include <stdlib.h>

/* Undo SGI Madness #ifdef ftrunc # undef ftrunc #endif #ifdef qexp # undef qexp #endif #ifdef qgamma # undef qgamma #endif /* ----- The following constants and entry points are part of the R API
/* 30 Decimal-place constants /* Computed with be -t (scale=32; proper round) /* SVID & X/Open Constants */

/* Names from Solaris math.h #ifndef M -E
#define M_E 2.718281828459045235360287471353 /* e #endif #ifndef M_LOG2E
#define M_LOG2E 1.442695040888963407359924681002 /* log2(e) #endif #ifndef M_LOG10E
#define M_LOG10E 0.434294481903251827651128918917 /* loglO(e) #endif #ifndef M LN2 #define M_LN2 0.693147180559945309417232121458 /* In(2) #endif #ifndef M_LN10 #define M_LN10 2.302585092994045684017991454684 /* In(10) #endif #ifndef M_PI
#define M_PI 3.141592653589793238462643383280 /* pi #endif #ifndef M_2PI
#define M_2P1 6.283185307179586476925286766559 /* 2*pi #endif #ifndef M_PI_2 #define M_PI_2 1.570796326794896619231321691640 /* pi/2 #endif #ifndef M_PI_4 #define M_PI_4 0.785398163397448309615660845820 /* pi/4 #endif #ifndef M_1_PI
#define M_1_PI 0.318309886183790671537767526745 /* 1/pi #endif #ifndef M-2-PI

#define M_2_PI 0.636619772367581343075535053490 /* 2/pi #endif #ifndef M_2_SQRTPI
#define M_2_SQRTPI 1.128379167095512573896158903122 /* 2/
sqrt(pi) #endif #ifndef M_SQRT2 #define M_SQRT2 1.414213562373095048801688724210 /* sqrt(2) #endif #ifndef M_SQRT1_2 #define M_SQRT1_2 0.707106781186547524400844362105 /* 1/
*/
sqrt(2) #endif /* R-Specific Constants #ifndef M_SQRT_3 #define M_SQRT_3 1.732050807568877293527446341506 /* sqrt(3) #endif #ifndef M_SQRT_32 #define M_SQRT_32 5.656854249492380195206754896838 /* sqrt(32) #endif #ifndef M_LOG10_2 #define M_LOG10_2 0.301029995663981195213738894724 /* loglO(2) #endif #ifndef M_SQRT_PI
#define M_SQRT_PI 1.772453850905516027298167483341 /* sqrt(pi) #endif #ifndef M_1_SQRT_ZPI
#define M_1_SQRT_2PI 0.398942280401432677939946059934 /* 1/
sqrt(2pi) #endif #ifndef M_SQRT_2dPI

#define M_SQRT_2dPI 0.797884560802865355879892119869 /* sqrt(2/
pi) */
#endif #ifndef M_LN_SQRT_PI
#define M_LN_SQRT_PI 0.572364942924700087071713675677 /*
log(sqrt(pi)) #endif #ifndef M_LN_SQRT_2PI
#define M_LN_SQRT_2P1 0.918938533204672741780329736406 /*
log(sgrt(2*pi)) */
#endif #ifndef M_LN_SQRT_PId2 #define M_LN_SQRT_PId2 0.225791352644727432363097614947 /*
log(sqrt(pi/2)) #endif #ifdef MATHLIB_STANDALONE
#undef FALSE
#undef TRUE
typedef enum { FALSE = 0, TRUE } Rboolean;
#else # include <R_ext/Boolean.h>
#endif #ifndef MATHLIB_STANDALONE
#define bessel_i Rf_bessel_i #define bessel_j Rf_bessel_j #define bessel_k Rf_bessel_k #define bessel_y Rf_bessel_y #define beta Rf_beta #define choose Rf_choose #define dbeta Rf_dbeta #define dbinom Rf dbinom #define dcauchy Rf_dcauchy #define dchisq Rf_dchisq #define dexp Rf_dexp #define df Rf df #define dgamma Rf_dgamma #define dgeom Rf_dgeom #define dhyper Rf_dhyper #define digamma Rf_digammo #define dinorm Rf_dlnorm #define dlogis Rf_dlogis #define dnbeta Rf_dnbeta #define dnbinom Rf_dnbinom #define dnchisq Rf_dnchisq #define dnf Rf dnf #define dnorm4 Rf_dnorm4 #define dnt Rf_dnt #define dpois Rf_dpois #define dpsifn Rf_dpsifn #define dsignrank Rf_dsignrank #define dt Rf_dt #define dtukey Rf_dtukey #define dunif Rf_dunif #define dweibull Rf_dweibull #define dwilcox Rf_dwilcox #define fmax2 Rf fmax2 #define fmin2 Rf_fmin2 #define fprec Rf_fprec #define fround Rf_fround #define ftrunc Rf ftrunc #define fsign Rf_fsign #define gammafn Rf_gammafn #define imax2 Rf_imax2 #define imin2 Rf_imin2 #define theta Rf lbeta #define lchoose Rf_lchoose #define lgammafn Rf_lgammafn #define lgammalp Rf_lgammalp #define loglpmx Rf_loglpmx #define logspace_add Rf_logspace_add #define logspace_sub Rf_logspace_sub #define pbeta Rf_pbeta #define pbeta_raw Rf_pbeta_row #define pbinom Rf_pbinom #define pcauchy Rf_pcauchy #define pchisq Rf_pchisq #define pentagamma Rf_pentagamma #define pexp Rf_pexp #define pf Rf_pf #define pgamma Rf_pgamma #define pgeom Rf_pgeom #define phyper Rf_phyper #define plnorm Rf_plnorm #define plogis Rf_plogis #define pnbeta Rf_pnbeto #define pnbinom Rf_pnbinom #define pnchisq Rf_pnchisq #define pnf Rf_pnf #define pnorm5 Rf_pnorm5 #define pnorm_both Rf_pnorm_both #define pnt Rf_pnt #define ppois Rf_ppois #define psignrank Rf_psignrank #define psigamma Rf_psigamma #define pt Rf_pt #define ptukey Rf_ptukey #define punif Rf_punif #define pythag Rf_pythag #define pweibull Rf_pweibull #define pwilcox Rf_pwilcox #define gbeta Rf_gbeto #define qbinom Rf_gbinom #define qcauchy Rf_gcauchy #define qchisq Rf_gchisq #define gchisq_appr Rf_gchisq_appr #define qexp Rf_gexp #define of Rf_gf #define ggamma Rf_ggamma #define qgeom Rf_ggeom #define qhyper Rf_ghyper #define ginorm Rf_glnorm #define qlogis Rf_glogis #define gnbeta Rf_gnbeto #define qnbinom Rf_gnbinom #define qnchisq Rf_gnchisq #define qnf Rf_gnf #define qnorm5 Rf_gnorm5 #define qnt Rf_gnt #define qpois Rf_gpois #define qsignrank Rf_gsignrank #define qt Rf_gt #define qtukey Rf_gtukey #define qunif Rf_qunif #define qweibull Rf_gweibull #define qwilcox Rf_gwilcox #define rbeta Rf_rbeta #define rbinom Rf_rbinom #define rcouchy Rf_rcauchy #define rchisq Rf_rchisq #define rexp Rf_rexp #define rf Rf_rf #define rgamma Rf_rgamma #define rgeom Rf_rgeom #define rhyper Rf_rhyper #define rlnorm Rf_rlnorm #define rlogis Rf_rlogis #define rnbeta Rf_rnbeta #define rnbinom Rf_rnbinom #define rnchisq Rf_rnchisq #define rnf Rf_rnf #define rnorm Rf_rnorm #define rnt Rf_rnt #define rpois Rf_rpois #define rsignrank Rf_rsignrank #define rt Rf_rt #define rtukey Rf_rtukey #define runif Rf_runif #define rweibull Rf_rweibutl #define rwitcox Rf_rwilcox #define sign Rf_sign #define tetragamma Rf_tetragomma #define trigamma Rf_trigamma #endif #define rround fround #define prec fprec #undef trunc #define trunc ftrunc /* log(1 - exp(x)) in stable form:
#define R_Logl_Exp(x) ((x) > -M_LN2 ? log(-expml(x)) : loglp(-exp(x))) /* R's versions with !R-FINITE checks */

#if defined(MATHLIB_STANDALONE) && defined(HAVE_WORKING_LOG) #define R_log log #else double R_log(double x);
#endif double R_pow(double x, double y);
double R_pow_di(double, int);

/* Random Number Generators double norm_rand(void);

double unif_rand(void);
double exp_rand(void);
#ifdef MATHLIB STANDALONE
void set_seed(unsigned int, unsigned int);
void get_seed(unsigned int *, unsigned int *);
#endif /* Normal Distribution */
#define pnorm pnorm5 #define qnorm qnorm5 #define dnorm dnorm4 double dnorm(double, double, double, int);
double pnorm(double, double, double, int, int);
double qnorm(double, double, double, int, int);
double rnorm(double, double);
void pnorm_both(double, double *, double *, int, int);/* both tails */
/* Uniform Distribution */

double dunif(double, double, double, int);
double punif(double, double, double, int, int);
double qunif(double, double, double, int, int);
double runif(double, double);

/* Gamma Distribution */

double dgamma(double, double, double, int);
double pgamma(double, double, double, int, int);
double qgamma(double, double, double, int, int);
double rgamma(double, double);

double loglpmx(double);
double tgammalp(double);
double logspace_add(double, double);
double logspace_sub(double, double);
/* Beta Distribution */

double dbeta(double, double, double, int);
double pbeta(double, double, double, int, int);
double qbeta(double, double, double, int, int);
double rbeta(double, double);

/* Lognormal Distribution */

double dlnorm(double, double, double, int);
double plnorm(double, double, double, int, int);
double qlnorm(double, double, double, int, int);
double rlnorm(double, double);

/* Chi-squared Distribution */
double dchisq(double, double, int);
double pchisq(double, double, int, int);
double qchisq(double, double, int, int);
double rchisq(double);

/* Non-central Chi-squared Distribution */
double dnchisq(double, double, double, int);
double pnchisq(double, double, double, int, int);
double qnchisq(double, double, double, int, int);
double rnchisq(double, double);

/* F Distibution */

double df(double, double, double, int);
double pf(double, double, double, int, int);
double qf(double, double, double, int, int);
double rf(double, double);

/* Student t Distibution */
double dt(double, double, int);
double pt(double, double, int, int);
double qt(double, double, int, int);
double rt(double);

/* Binomial Distribution double dbinom(double, double, double, int);
double pbinom(double, double, double, int, int);
double gbinom(double, double, double, int, int);
double rbinom(double, double);

/* Multnomial Distribution void rmultinom(int, double*, int, int*);
/* Cauchy Distribution double dcauchy(double, double, double, int);
double pcauchy(double, double, double, int, int);
double gcauchy(double, double, double, int, int);
double rcauchy(double, double);

/* Exponential Distribution */
double dexp(double, double, int);
double pexp(double, double, int, int);
double qexp(double, double, int, int);
double rexp(double);

/* Geometric Distribution */
double dgeom(double, double, int);
double pgeom(double, double, int, int);
double qgeom(double, double, int, int);
double rgeom(double);

/* Hypergeometric Distibution double dhyper(double, double, double, double, int);
double phyper(double, double, double, double, int, int);
double qhyper(double, double, double, double, int, int);
double rhyper(double, double, double);

/* Negative Binomial Distribution */
double dnbinom(double, double, double, int);
double pnbinom(double, double, double, int, int);
double qnbinom(double, double, double, int, int);
double rnbinom(double, double);

/* Poisson Distribution double dpois(double, double, int);
double ppois(double, double, int, int);
double qpois(double, double, int, int);
double rpois(double);

/* Weibull Distribution double dweibull(double, double, double, int);
double pweibull(double, double, double, int, int);
double qweibull(double, double, double, int, int);

double rweibull(double, double);

/* Logistic Distribution */

double dlogis(double, double, double, int);
double plogis(double, double, double, int, int);
double qlogis(double, double, double, int, int);
double rlogis(double, double);

/* Non-central Beta Distribution */

double dnbeta(double, double, double, double, int);
double pnbeta(double, double, double, double, int, int);
double qnbeta(double, double, double, double, int, int);
double rnbeta(double, double, double);

/* Non-central F Distribution */

double dnf(double, double, double, double, int);
double pnf(double, double, double, double, int, int);
double qnf(double, double, double, double, int, int);
/* Non-central Student t Distribution */

double dnt(double, double, double, int);
double pnt(double, double, double, int, int);
double qnt(double, double, double, int, int);

/* Studentized Range Distribution */

double ptukey(double, double, double, double, int, int);
double qtukey(double, double, double, double, int, int);
/* Wilcoxon Rank Sum Distribution */

double dwilcox(double, double, double, int);
double pwilcox(double, double, double, int, int);
double qwilcox(double, double, double, int, int);
double rwilcox(double, double);

/* Wilcoxon Signed Rank Distribution double dsignrank(double, double, int);
double psignrank(double, double, int, int);
double gsignrank(double, double, int, int);
double rsignrank(double);

/* Gamma and Related Functions */
double gammafn(double);
double lgammafn(double);
void dpsifn(double, int, int, int, double*, int*, int*);
double psigamma(double, double);
double digamma(double);
double trigamma(double);
double tetragamma(double);
double pentagamma(double);
double beta(double, double);
double lbeta(double, double);
double choose(double, double);
double lchoose(double, double);

/* Bessel Functions */

double bessel_i(double, double, double);
double bessel_j(double, double);
double bessel_k(double, double, double);
double bessel_y(double, double);

/* General Support Functions */
double pythag(double, double);
#ifndef HAVE_EXPM1 double expml(double); /* = exp(x)-1 {care for small x}
#endif #ifndef HAVE_LOG1P
double loglp(double); /* = log(1+x) {care for small x} */
#endif int imax2(int, int);
int imin2(int, int);
double fmax2(double, double);
double fmin2(double, double);
double sign(double);
double fprec(double, double);
double fround(double, double);
double fsign(double, double);
double ftrunc(double);

double loglpmx(double); /* Accurate log(l+x) - x, {care for small x}
double lgammalp(double);/* accurate log(gamma(x+1)), small x (0 < x < 0.5) /* Compute the log of a sum or difference from logs of terms, i.e., *
* log (exp (logx) + exp (logy)) * or log (exp (logx) - exp (logy)) *
* without causing overflows or throwing away too much accuracy:
double logspace_add(double logx, double logy);
double logspace_sub(double logx, double logy);

/* ----------------- Private part of the header file -------------------/* old-R Compatibility #define snorm norm-rand #define sunif unif rand #define sexp exp_rand #ifdef MATHLIB_PRIVATE
#define dimach Rf_dlmach #define ilmach Rf_ilmach #define gamma_cody Rf_gamma_cody double gamma_cody(double); /* used in arithmetic.c #endif /* MATHLIB_PRIVATE */

double Rf_dlmach(int); /* used in port.c in package stats int Rf_ilmach(int); /* used in port.c in package stats #ifdef MATHLIB_STANDALONE
#ifndef MATHLIB_PRIVATE_H

/* If isnan is a macro, as C99 specifies, the C++
math header will undefine it. This happens on OS X
#ifdef __cplusplus int R_isnancpp(double); /* in mlutils.c # define ISNAN(x) R_isnancpp(x) #else # define ISNAN(x) (isnan(x)!=0) #endif /* We don't have config information available to do anything else */
#define R_FINITE(x) R_finite(x) int R_finite(double);

#ifdef WIN32 /* not Win32 as no config information */
# define NA-REAL (*-imp--NA-REAL) # define R_Neglnf (*_imp__R_NegInf) # define R_Poslnf (*_imp__R_PosTnf) # define N01-kind (*_imp__N01_kind) # endif #endif /* not MATHLIB_PRIVATE_H
#endif /* MATHLIB_STANDALONE
#ifndef R_EXT_PRINT_H_ void REprintf(char*, ...);
#endif #ifdef __cplusplus I
#endif #endif /* RMATH_H

Claims (44)

1. A method for determining a likelihood that a subject contributed genetic material to a test genetic material sample, said method comprising:
providing a test genetic material sample;
performing a single nucleotide polymorphism analysis on the test genetic material sample, whereby at least 50 different single nucleotide polymorphisms in said test genetic material sample are analyzed, thereby creating a sample SNP
signature; and comparing the sample SNP signature to a subject's SNP signature to determine a likelihood that the subject contributed genetic material to a test genetic material sample.
2. The method of Claim 1, wherein comparing the sample SNP signature to determine the likelihood that it matches a subject's SNP signature further comprises providing and employing a reference SNP signature.
3. The method of Claim 2, wherein the reference SNP signature has a similar ancestral make-up as that of the sample SNP signature.
4. The method of Claim 1, wherein the test genetic material sample is likely to be contaminated.
5. The method of Claim 4, wherein the contamination comprises bacterial genetic material.
6. The method of Claim 4, wherein the contamination comprises nonhuman genetic material.
7. A method of characterizing a test genetic material sample, said method comprising:
providing a first allele frequency for a SNP for a person of interest (POI);
providing a second allele frequency for the SNP from a reference population of genetic material;
providing a third allele frequency for the SNP for the test genetic material sample;
repeating the above processes for at least 10 different SNPs; and analyzing the first, second, and third allele frequencies to characterize the test genetic material sample.
8. The method of Claim 7, wherein the processes are repeated for at least 50 different SNPs.
9. The method of Claim 8, wherein analyzing the first, second, and third allele frequencies is achieved by the following processes:
a) determining the absolute value of the difference in the allele frequencies of the person of interest and the reference population;
b) determining the absolute value of the difference in the allele frequencies of the person of interest and the test genetic material sample;
and c) subtracting b) from a) to obtain a distance value for the SNP.
10. The method of Claim 9, wherein when the distance value for the SNP is positive, it is more likely that the POI contributed genetic material to the test genetic material sample, when the distance value is negative, the POI's genetic material is more likely to be part of the reference sample, and when the distance value is 0, the POI's genetic material is equally likely to be in the test genetic material sample and the reference sample.
11. The method of Claim 10, wherein the above processes are repeated for at least 50,000 SNPs
12. The method of Claim 10, wherein the frequencies are expressed as a numerical value.
13. The method of Claim 10, wherein the frequencies are expressed as fluorescence levels.
14. The method of Claim 10, wherein the frequencies are expressed as normalized values for the POI, reference population, and test genetic material sample.
15. The method of Claim 8, wherein the characterization allows one to determine if there is at least a 99% likelihood that the person of interest contributed to the sample.
16. The method of Claim 8, wherein the characterization determines that the test genetic material sample contains genetic material from a person other than the person of interest.
17. The method of Claim 8, wherein the characterization determines a likelihood that the test genetic material sample contains genetic material from the person of interest.
18. The method of Claim 8, wherein the test genetic material sample comprises degraded genetic material.
19. The method of Claim 8, wherein the test genetic material sample is collected from a crime scene and the characterization is performed to identify if the test genetic material sample includes DNA from the person of interest.
20. The method of Claim 8, further comprising the process of collecting a test genetic material sample, running the sample on a SNP detecting array, and monitoring what SNPs are present in the sample, thereby providing the third allele frequency for the SNP for the test genetic material sample.
21. The method of Claim 8, wherein providing a third allele frequency for the SNP for the test genetic material sample comprises having the frequency for the SNP for the test genetic material sample.
22. The method of Claim 8, wherein the characterization comprises the following analysis:

T(Y i) = (mean(D(Y ij)) - µ0) / (sd(D(Y ij)/ sqrt(s))) wherein µ0 is the mean of D(Y k) over individuals Y k not in the mixture, sd(D(Y ij)) is the standard deviation of D(Y ij) for all SNPs j and individual Y i, sqrt(s) is the square root of the number of SNPs, and D(Y ij)=¦Y ij-Pop j¦-¦Y
ij-M j¦, where Y ij= allele frequency of individual for SNP j, Pop j= allele frequency of reference population for SNP j, and M j=allele frequency of mixture for SNP
j.
23. The method of Claim 22, wherein µ0 is zero.
24. The method of Claim 8, wherein the test genetic material sample comprises genetic material from at least two different organisms.
25. The method of Claim 8, wherein the test genetic material sample comprises genetic material from at least 10 different organisms.
26. The method of Claim 8, wherein the test genetic material sample comprises genetic material from at least two different humans.
27. The method of Claim 8, wherein the test genetic material sample comprises genetic material from at least 100 different organisms.
28. The method of Claim 8, wherein the characterization is achieved without knowing the number of individuals that contributed to the test genetic material sample.
29. The method of Claim 8, wherein the characterization is achieved without computationally considering the number of individuals that contributed to the test genetic material sample.
30. The method of Claim 8, wherein the method is performed on a computer and wherein the characterization is output to a user.
31. The method of Claim 30, wherein the computer comprises software for implementing the method.
32. The method of Claim 31, wherein the software comprises that attached in Appendix A.
33. A method of characterizing a test genetic material sample to determine if a person of interest's ("POI's") genetic material is within the test genetic material sample, said method comprising:
providing a SNP analysis of the test genetic material sample;
providing a SNP analysis of a reference genetic material sample;
providing a SNP analysis of a POI's genetic material;
in a first comparison, comparing the SNP analysis of the test genetic material sample to the SNP analysis of the POI's genetic material;
in a second comparison, comparing the SNP analysis of the reference genetic material to the SNP analysis of the POI's genetic material; and comparing the first and second comparisons, thereby determining if the POI's genetic material is likely in the test genetic material sample.
34. The method of Claim 33, wherein, the SNP analysis of the POI's genetic material comprises the SNP identities of at least 100 SNPs.
35. The method of Claim 33, wherein genomic DNA from the POI is present in the test genetic material sample in an amount of less than 1% of total genomic DNA in the test genetic material sample.
36. The method of Claim 33, wherein DNA from the POI's is present in the test genetic material sample in an amount of less than 0.1% of the total genomic DNA in the test genetic material sample.
37. The method of Claim 33, wherein a probe is used to analyze the SNP of the test genetic material sample, and wherein the probe variance is less than 20%.
38. The method of Claim 33, wherein at least 1,000 SNPs are analyzed in the test genetic material sample.
39. A kit for analyzing a test genetic material sample, said kit comprising:
software on a computer readable format for implementing the method of Claim 33; and a set of probes for binding to and detecting one or more SNPs.
40. A method for determining if a person of interest contributed genetic material to a test genetic material sample, said method comprising determining a bias of an allele frequency within SNPs of the test genetic material sample relative to a reference and a subject's SNP signature.
41. A system for determining if a subject contributed genetic material to a sample, the system comprising:
an input module configured to allow the input of one or more of a sample SNP signature, a reference SNP signature, and a subject SNP signature;
a module configured to determine a bias of an allele frequency within SNPs of the sample SNP signature relative to the reference SNP signature and the subject SNP signature; and a module configured to output the bias, wherein one or more of the modules is executed on a computing device.
42. The system of Claim 41, further comprising a module configured to provide a sample SNP signature;
43. The system of Claim 41, further comprising a module configured to provide a reference SNP signature;
44. The system of Claim 41, further comprising a module configured to provide a subject SNP signature;
CA2731830A 2008-07-23 2009-07-22 Method of characterizing sequences from genetic material samples Abandoned CA2731830A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US8291208P 2008-07-23 2008-07-23
US61/082,912 2008-07-23
PCT/US2009/051441 WO2010011776A1 (en) 2008-07-23 2009-07-22 Method of characterizing sequences from genetic material samples

Publications (1)

Publication Number Publication Date
CA2731830A1 true CA2731830A1 (en) 2010-01-28

Family

ID=41129339

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2731830A Abandoned CA2731830A1 (en) 2008-07-23 2009-07-22 Method of characterizing sequences from genetic material samples

Country Status (8)

Country Link
US (2) US20100086926A1 (en)
EP (1) EP2332082A1 (en)
CN (1) CN102165456B (en)
AU (1) AU2009274031A1 (en)
BR (1) BRPI0915619A2 (en)
CA (1) CA2731830A1 (en)
CO (1) CO6351830A2 (en)
WO (1) WO2010011776A1 (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010126614A2 (en) 2009-04-30 2010-11-04 Good Start Genetics, Inc. Methods and compositions for evaluating genetic markers
WO2011067765A1 (en) * 2009-12-03 2011-06-09 Yissum Research Development Company Of The Hebrew University Of Jerusalem, Ltd. System and method for analyzing dna mixtures
US9163281B2 (en) 2010-12-23 2015-10-20 Good Start Genetics, Inc. Methods for maintaining the integrity and identification of a nucleic acid template in a multiplex sequencing reaction
US9228233B2 (en) 2011-10-17 2016-01-05 Good Start Genetics, Inc. Analysis methods
US8209130B1 (en) 2012-04-04 2012-06-26 Good Start Genetics, Inc. Sequence assembly
US10227635B2 (en) 2012-04-16 2019-03-12 Molecular Loop Biosolutions, Llc Capture reactions
CA2890441A1 (en) * 2012-11-07 2014-05-15 Good Start Genetics, Inc. Methods and systems for identifying contamination in samples
EP2971159B1 (en) 2013-03-14 2019-05-08 Molecular Loop Biosolutions, LLC Methods for analyzing nucleic acids
US10851414B2 (en) 2013-10-18 2020-12-01 Good Start Genetics, Inc. Methods for determining carrier status
WO2015175530A1 (en) 2014-05-12 2015-11-19 Gore Athurva Methods for detecting aneuploidy
US11328794B2 (en) 2014-06-18 2022-05-10 The Regents Of The University Of California Method for determining relatedness of genomic samples using partial sequence information
JP2016051461A (en) * 2014-08-29 2016-04-11 日本コントロールシステム株式会社 Clustering apparatus, clustering method, and program
WO2016040446A1 (en) 2014-09-10 2016-03-17 Good Start Genetics, Inc. Methods for selectively suppressing non-target sequences
US10429399B2 (en) 2014-09-24 2019-10-01 Good Start Genetics, Inc. Process control for increased robustness of genetic assays
EP4095261A1 (en) 2015-01-06 2022-11-30 Molecular Loop Biosciences, Inc. Screening for structural variants
US10854316B2 (en) * 2015-12-03 2020-12-01 Syracuse University Methods and systems for prediction of a DNA profile mixture ratio
CN105463116B (en) * 2016-01-15 2018-08-28 中南大学 A kind of Forensic medicine composite detection kit and detection method based on 20 triallelic SNP genetic markers
WO2018144135A1 (en) * 2017-01-31 2018-08-09 Counsyl, Inc. Systems and methods for inferring genetic ancestry from low-coverage genomic data
JP2020515978A (en) * 2017-03-29 2020-05-28 ナントミクス,エルエルシー Multi-sequence file signature hash
CN108823296B (en) * 2017-05-05 2021-12-21 深圳华大基因股份有限公司 Method and kit for detecting nucleic acid sample pollution and application
AU2018317875A1 (en) * 2017-08-17 2020-03-05 Tai Diagnostics, Inc. Methods of determining donor cell-free DNA without donor genotype
US11931674B2 (en) 2019-04-04 2024-03-19 Natera, Inc. Materials and methods for processing blood samples
CN111575386B (en) * 2020-05-27 2023-10-03 广州市刑事科学技术研究所 Fluorescent composite amplification kit for detecting human Y-SNP locus and application thereof
WO2021251834A1 (en) * 2020-06-10 2021-12-16 Institute Of Environmental Science And Research Limited Methods and systems for identifying nucleic acids

Also Published As

Publication number Publication date
BRPI0915619A2 (en) 2016-11-01
CO6351830A2 (en) 2011-12-20
US10679728B2 (en) 2020-06-09
WO2010011776A1 (en) 2010-01-28
EP2332082A1 (en) 2011-06-15
CN102165456A (en) 2011-08-24
US20100086926A1 (en) 2010-04-08
AU2009274031A1 (en) 2010-01-28
US20170206311A1 (en) 2017-07-20
CN102165456B (en) 2014-07-23

Similar Documents

Publication Publication Date Title
US10679728B2 (en) Method of characterizing sequences from genetic material samples
US20200105372A1 (en) Methods and processes for non-invasive assessment of genetic variations
US20200160934A1 (en) Methods and processes for non-invasive assessment of genetic variations
Kuleshov et al. Whole-genome haplotyping using long reads and statistical methods
Fungtammasan et al. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications
AU2018288772B2 (en) Methods and systems for decomposition and quantification of dna mixtures from multiple contributors of known or unknown genotypes
US20130324417A1 (en) Determining the clinical significance of variant sequences
Harjanto et al. RNA editing generates cellular subsets with diverse sequence within populations
JP2021505977A (en) Methods and systems for determining somatic mutation clonality
US20230287487A1 (en) Systems and methods for genetic identification and analysis
CN115132272A (en) Noninvasive prenatal molecular karyotyping of maternal plasma
Olson et al. Variant calling and benchmarking in an era of complete human genome sequences
JP7009516B2 (en) Methods for Accurate Computational Degradation of DNA Mixtures from Contributors of Unknown Genotypes
Ashbrook et al. Private and sub-family specific mutations of founder haplotypes in the BXD family reveal phenotypic consequences relevant to health and disease
US11869630B2 (en) Screening system and method for determining a presence and an assessment score of cell-free DNA fragments
Meisner et al. Computational methods used in systems biology
NZ759848B2 (en) Liquid sample loading
NZ759848A (en) Method and apparatuses for screening
Sharma Novel Algorithms to Estimate Genome Coverage Using High Throughput Sequencing Data
Palluzzi Novel genome-scale data models and algorithms for molecular medicine and biomedical research
WO2024073278A1 (en) Detecting and genotyping variable number tandem repeats
Irizarry et al. Model-Based Quality Assessment and Base-Calling for Second-Generation Sequencing Data

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20140417

FZDE Discontinued

Effective date: 20170627