HK1151069B

HK1151069B - Method of pooling samples for performing a biological assay

Info

Publication number: HK1151069B
Application number: HK11105101.2A
Authority: HK
Inventors: 阿德里安乌斯‧拉姆贝图斯‧约翰纳斯‧韦雷吉肯; 安内米克‧波拉‧容格乌斯; 赫拉尔杜斯‧安东尼厄斯‧阿诺尔德斯‧阿尔贝斯
Original assignee: 亨德里克斯基因有限公司
Priority date: 2007-10-31
Filing date: 2008-10-31
Publication date: 2013-11-22

Description

Method of pooling samples for performing bioassays

Technical Field

The present invention relates to the field of measurements with classification results for measurements on biological samples, and more particularly to a sample preparation method for bioassays with classification results. The present invention provides methods of pooling samples and the use of the methods in genotyping of allelic variables. The invention also provides a method of analysing a plurality of samples, a pooling device for pooling a plurality of samples into a pooled sample, an analysing device comprising a processor for analysing a series of pooled samples, a computer program product for carrying out the method of pooling samples and a computer program product for carrying out the method of analysing a plurality of samples.

Background

Bioassays are methods of determining the identity, concentration or presence of a biological analyte in a sample. Bioassays are an inherent part of all scientific research, most notably in the field of life sciences, especially molecular biology.

One particular type of analysis in molecular biology involves genotyping and sequencing. Genotyping and sequencing refers to the process of determining the genotype of an individual using a bioassay. Current methods include PCR, DNA and RNA sequencing, and hybridization to DNA or RNA microarrays immobilized on various carriers (e.g., glass slides or beads). This technology is essential for the testing of father/mother identity, clinical studies to study disease-related genes, and other studies aimed at studying genetic control of any species trait, such as scanning the entire genome for QTL (quantitative trait loci).

Due to the current technology, almost all genotyping is only partial. That is, only a small fraction of the individual genotypes are determined. In many cases, this is not a problem. For example, when testing for paternal/maternal identity, only 10 to 20 genomic regions are studied to determine whether there is an affinity, these 10 to 20 genomic regions being only a small part of the human genome.

Single Nucleotide Polymorphisms (SNPs) are the most abundant type of polymorphism in the genome. With the parallel development of high-density SNP marker maps and high-throughput SNP genotyping technology, SNP has become the marker of choice for many genetic studies. A large number of samples are required in both mapping and correlation studies or in genomic screening experiments.

In order to provide high throughput genotyping capabilities, array technology has been developed. Such techniques are available from commercial suppliers, such as Affymetrix (microarray-based Gene chips)Mapping array), Illumina (beararray)^TM)、Biotrove(OpenArray^TM) And Sequenom (MassARRAY)^TM). A large number of SNPs are available today or in the near future for many species (human, livestock, plants, bacteria and viruses). Innovative technologies have enabled the completion of genome-wide genotyping or association studies and related genome-wide screening programs for plant and animal breeding. However, the cost of such methods is still very high, requiring up to millions of dollars in budget if the samples are individually genotyped. Thus, it is possible to provideStudies aimed at determining SNPs of any species currently involve analysis of only a limited number of individuals. Therefore, the present invention is very important because it can significantly reduce the cost of genotyping.

In order to fully understand gene diversity, the complete sequence of the genome (the relevant part) must be known. However, the cost of determining the complete sequence is even higher than that of genotyping as described in the preceding paragraph. In addition to cost, it is also desirable that sequencing will replace genotyping to provide the entire genome of an individual's genotype or specific regions thereof. The invention also provides methods for reducing sequencing costs.

Sample pooling (pooling) is often used in the study of categorical traits as a means of reducing assay costs. The presence of a property in a pool (pool) consisting of a mixture of several samples indicates that at least one sample in the pool has that property. For example, DNA pools are used for:

-estimating allele frequencies in the population.

By taking a suitable sample of individuals from the population, the coarse allele frequency of allele 1 is calculated as the ratio of the result of allele 1 in the pool to the sum of the result of allele 1 and the result of allele 2.

-event (case) -control correlation study, wherein events and controls are divided into different pools, and

-reconstituting haplotype on a few individuals and on a few SNPs.

Depending on the allele frequencies measured in the pool, the haplotype can be estimated by different algorithms, such as maximum likelihood. The term haplotype frequency is synonymous with the term joint distribution of markers.

An important drawback of sample pooling is that the measured properties are determined only in the pool as a whole, and not in any individual sample in the pool. One exception is the DNA pool for genotyping three individuals (father, mother and child) when two pools of two individuals (father + child and mother + child), respectively, are established. The allele frequencies observed in each pool showed the genotypes of all 3 individuals. This type of sample pooling reduces the cost by 33%, but is only possible with such three individuals. In all other cases, the individual in the combined sample must be reanalyzed in order to obtain the results for the individual sample.

It would therefore be advantageous to provide a sample pool of sample types other than three individuals, and still provide test results for individual samples in the pool.

Disclosure of Invention

Now, the inventors of the present invention have found that random individuals can be pooled and that an individual genotype can be obtained from the pool when the contribution of each sample in the pool is a fixed proportion to the contribution of each other sample, i.e. when the sample amounts are not provided in equimolar (equimolar) but in a specific proportion. If the test involves a quantitative measurement of a categorical variable, i.e., the test involves a categorical or discrete trait that is quantitatively measured, the results of the individual sample can be inferred from the combined test results.

Indeed, the inventors of the present invention have found that for studies of the presence of a certain allele at a certain locus in diploid animals, mixing a DNA sample of a first diploid animal having 2 possible alleles (a or B) at a single locus and a DNA sample of a second diploid animal also having 2 possible alleles (a or B) at the same locus in a ratio of 1: 3 results in 8 possibilities for the presence of either allele in the mixture (2) + (2+2+2), where the expected quantifier signal for a single allele (e.g. a) is 12.5% of the maximum sample signal intensity. This indicates that when the measured signal intensity is 37.5% of the maximum sample signal intensity, the sample contains 3 times (3 ×) the allele a, indicating that the signal cannot be derived from the first diploid animal, but only from the sourceIn a second diploid animal, this indicates that the first diploid animal has genotype BB and the second diploid animal has genotype AB. Likewise, all samples had genotype AB when the measured signal intensity was 50% of the maximum sample signal intensity. When the measured signal intensity is 0% of the maximum sample signal intensity, then all samples have genotype BB. Two individuals in the pool had a total of 3x 3 possible genotypes. If the accuracy of the measurements is at least 6.25%, each measurement may be assigned a value of one-eighth (1/8) of 100% or a multiple thereof. In general, each possible measurement may be assigned a value of 1/(y ((p + 1))⁰+(p+1)¹+(p+1)²+(p+1)^(n-1)) 100%, where y-2 (two possible outcomes of allele a at one position, allele present or absent), p is the ploidy level, n is the number of samples, and 100% is the maximum sample signal intensity. Overall, there are (ploidy level +1) n possible genotypes.

Now, when the pooled samples were of 3 animals (x, y and z) in a ratio of 1: 3: 9 (i.e., a pooling factor of 3, respectively), there were theoretically a total of 26 possibilities for any one allele in the mixture, with the expected quantitative signal for a single allele (e.g., A) being 3.85% of the maximum sample signal intensity. This means that the measured signal intensity is 12% of the maximum sample signal intensity, and samples containing 3 times (3x) allele a show that animal x has genotype BB, animal y has genotype AB, and animal z has genotype BB. Similarly, when the measured signal intensity is 96% of the maximum sample signal intensity, sample x has genotype AB, and samples y and z have genotype AA. If the accuracy of the measurements is at least 1.9%, each measurement may be assigned a value of one twenty-sixth of 100% (1/26) or a multiple thereof. (for a review of possible results of such a pooling experiment see examples below).

The inventors of the present invention have shown that this method can then be used in a large number of assays involving quantitative measurement of analytes in a sample, wherein the results of the assays are classified according to the nature of the analytes in the sample.

In a first aspect, the present invention now provides a method of pooling samples for analysis of a categorical variable, wherein the analysis involves quantitative measurement of an analyte, said method of pooling samples comprising providing a pool of n samples, wherein the number of individual samples in the pool is such that the analyte in a sample is present in x⁰∶x¹∶x²∶x^(n-1)Wherein x represents the number of classes of the categorical variable (or the pooling factor), x is an integer of 2 or more, such as 3, 4, 5, 6, 7 or 8, preferably 2 or 3, and n is the number of samples. x is the number of⁰∶x¹∶x²∶x^(n-1)Should be understood to mean x⁰∶x¹∶x²∶...∶x^(n-1)Or x⁰∶x¹∶x²∶xⁱ；x^(n-1)Where n is the number of samples and i is a progressively increasing integer whose value is between 2 and n.

For pooled polyploid individuals, x is equal to (ploidy level +1), so x is 2 for haploids with two possible alleles at one position, 3 for diploids, 5 for tetraploids, and x is also equal to the number of possible genotypes.

Assuming three possible alleles, a haploid has 3 possible genotypes (X ═ 3), a diploid has 6 possible genotypes (X ═ 6), and a triploid has 10 possible genotypes (X ═ 10). In a diploid individual, the first allele may occur 0, 1 or2 times, as may the second and third alleles. This allows to have the same ratio (x) as with two alleles (x is also the polyploidy level +1)⁰∶x¹∶x²∶x^(n-1)) Merging becomes possible. The signal intensities of the 3 alleles were rounded to the nearest result point (1/(y) ((p +1)⁰+(p+1)¹+(p+1)²+(p+1)^(n-1)) 100%, where y is 2 (allele 1, 2 or 3 present or absent), p is ploidy level, and n is the number of samples) to yield the number of alleles in the pooled sample.

Thus, the ratio between the two individual samples in the pool (as an example) is such that the analytes therein are present in a molar ratio of 1: x, where x is the maximum number of classes of the categorical trait.

The method in which the number of individual samples in the pool is measured as an geometric series with a common ratio of 3 is particularly suitable for genotyping allelic variables in diploid individuals, where each individual has three possible genotypes. This genotype is a categorical trait with three possible variables (AA, AB and BB).

The method in which the number of samples of an individual in the pool is defined as an geometric series with a common ratio of 2 is particularly suitable for genotyping allelic variables in haploid individuals. For examples thereof, reference is made to the experimental section below.

In another aspect, the present invention relates to the use of the above described method of the invention for genotyping of allelic variables in haploid or polyploid individuals, wherein the number of classes of the categorical variable (x) is equal to p +1, wherein p represents the ploidy level of said individual. For example, such applications may be used to genotype an allelic variable in a diploid or haploid individual.

In another aspect, the present invention relates to a method of analyzing a plurality of samples, comprising combining said samples according to the method of the present invention as described above to provide a combined sample, and performing said analysis on said combined sample. The resulting quantized result is then rounded to the nearest result point (determined by the number of theoretical intervals, where the maximum sample signal strength is divided according to each possible result, see below), and the signal strength is assigned as the total number of classes of the categorical variable of the combined samples. Thus, each body sample categorical variable in the pool is determined taking into account the proportions of the various individual samples in the pool.

In a further aspect, the present invention provides a method of analysing a plurality of samples, comprising analysing a series (or set) of pooled samples obtained by the method of pooling samples identified herein above, wherein a categorical variable of the samples is analysed, and relates to a quantitative measurement of an analyte in the samples.

In a preferred embodiment of the method, the method of performing an analysis further comprises the step of deriving from the measurements the contributions of the individual samples in said sample pool.

In another aspect, the present invention provides a device for combining a plurality of samples into a combined sample, comprising a sample aspirator (asparator) for providing a combined sample, and a processor for performing the method of combining samples as described above.

In another aspect, the present invention provides an assay device comprising a processor for analysing a series of pooled samples obtained by the method of pooling samples described above, wherein the device is arranged to analyse the categorical variable of the samples and perform a quantitative measurement of an analyte in the samples.

In a preferred embodiment of the assay device, the device further comprises a pooling device, most preferably a pooling device as disclosed above.

In a further aspect, the invention provides a computer program product, on its own or on a carrier, which when loaded (loaded) and executed in a computer, programmed computer network or other programmable device, carries out the method of merging samples described above.

In a further aspect, the invention provides a computer program product, by itself or on a carrier, which when loaded and executed in a computer, programmed computer network or other programmable device, carries out the method of analysing a plurality of samples as described above, the method comprising analysing a series of pooled samples obtained by the method of pooling samples described above, wherein the sample is analysed for a categorical variable and relates to the quantitative measurement of an analyte in the sample.

In a preferred embodiment of the computer program product, the method further comprises the step of combining according to the method of combining samples described above.

By using the method of the present invention, the cost of analysis can be greatly reduced, i.e., typically by 50%, or even by 66% or more.

Drawings

FIG. 1 shows a graph of the correlation between allele frequencies based on pooled data (Y-axis) and allele frequencies based on individual measurements (X-axis).

FIG. 2 shows a graph of the relationship between the measured allele frequency (Y-axis) of an individual and the predicted allele frequency (X-axis) in a pool.

FIG. 3 shows a graph of the relationship between corrected allele frequency (Y-axis) in the pool and the measured allele frequency (X-axis) of an individual after typing of the individual.

Figure 4 shows a graph of the difference between the expected (based on individual typing) and predicted allele frequencies for pool 1 in experiment 1.

Figure 5 shows a graph of the relationship between expected (based on individual typing) and predicted allele frequencies for all pools in experiment 2.

Figure 6 shows a graph of the difference between the expected (based on individual typing) and predicted allele frequencies for all pools in experiment 2.

Detailed Description

The term "categorical variable" as used herein refers to a discrete variable such as a property or trait, e.g., whether the analyte or property thereof is present, or whether an allelic trait is present in the analyte in homozygous or heterozygous form. "discrete" has the same meaning as "classified" and refers to non-linear or discontinuous. "variable" generally refers to a (categorical) trait that measures a characteristic of a sample. The categorical variable may be binary (consisting of two classes). "class" refers to a group or category of measurements that can be made. Thus, a purely categorical variable is one that can be assigned a category, the categorical variable taking on the value of one of several possible categories (classes). In particular, the categorical variable may relate to the presence of a genetic marker, such as a Single Nucleotide Polymorphism (SNP) or any other genetic marker, allele, immune response, disease, resistance, hair color, sex, disease infection status, genotype or any other trait or characteristic of the sample or organism. Although they can be quantitatively measured, for example as a resulting analyte signal that can be received, read and/or recorded by an analysis device, categorical variables do not have a quantitative meaning per se and categories do not have an inherent ordering. For example, gender is a categorical variable having two categories (male or female, typically coded as 0 and 1), preferably representing unordered categories. Genotype is also a categorical variable with multiple preferably unordered classes (AA, and AA, sometimes encoded as 2, 1, and 0).

In some aspects of the invention, the sample may be any sample for which a categorical variable is measured. The sample may be a biological sample such as a tissue or body fluid sample of an animal (including human) or plant, an environmental sample such as a soil, air or water sample. The sample may be a (partially) purified or an untreated (raw) sample. The sample is preferably a nucleic acid sample, such as a DNA sample.

The analyte whose presence or form is measured in a quantitative test can be a chemical or an organism. In a preferred embodiment, the analyte is a biomolecule and the categorical variable is a variant (variant) of said biomolecule. Preferably, the biomolecule is a nucleic acid, especially a polynucleotide, such as RNA, DNA, and the variable may for example be a nucleic acid polymorphism in said polynucleotide, such as an allelic variable, most preferably a SNP, or a base identity at a specific nucleotide position.

Thus, an analyte as defined herein may be a DNA molecule that exhibits a certain categorical variable (e.g., the identity of a base at a particular nucleotide position in the nucleic acid molecule, with a categorical value of A, T, C or G). The base identity of a particular nucleotide position can be measured using quantitative tests, for example based on the fluorescence of a cDNA copy from the nucleotide incorporating a fluorescent analogue, as is known in the art of DNA sequencing. A classification value is assigned to a nucleotide position, for example adenine, by a quantitative level of fluorescence emitted by the analogue in that particular position of the DNA and measured by the analysis means.

In determining the base identity of a particular nucleotide position, the present invention involves pooling individual samples whose nucleotide sequence of a particular nucleic acid is to be determined. When it is recognised that sequencing involves determining the signal of any one of four possible bases, where the signal of any particular base, with or without a particular position in, for example, a sequencing gel, corresponds to the presence or absence of that base characteristic in a particular nucleotide position in the nucleic acid, it can be appreciated that the method of the invention is suitable for sequencing (analysis). Combining the two samples in the ratios described herein before running the sequencing gel (sequence gel) enables the source of any particular signal to be determined and thus the sequence of each individual nucleic acid.

An "analyte" may be a polypeptide, such as a protein, peptide, or amino acid. The analyte may also be a ligand or fragment thereof of a nucleic acid, nucleic acid probe, antibody, antigen, receptor, hapten and receptor, (fluorescent) label, chromogen, radioisotope. In fact, the analyte may be formed from any chemical or physical substance that can be quantitatively measured and used to determine the class of the categorical variable.

The term "nucleotide" as used herein refers to a compound comprising a purine (adenine or guanine) or pyrimidine (thymine, cytosine or uracil) base linked to the C-1 carbon of a sugar, typically Ribose (RNA) or Deoxyribose (DNA), and further comprising one or more phosphate groups linked to the C-5 carbon of the sugar. The term includes individual constructs (building blocks) of nucleic acids or polynucleotides in which the sugar units of the individual nucleotides are linked by phosphodiester bridges to form a phosphosugar backbone with pending purine or pyrimidine bases.

The term "nucleic acid" as used herein includes polymers of deoxyribonucleotides or ribonucleotides, i.e., polynucleotides, in either single-or double-stranded form, and unless otherwise limited, encompasses known analogs (e.g., peptide nucleic acids) having the essential properties of natural nucleotides in that they hybridize to single-stranded nucleic acids in a form similar to naturally occurring nucleotides. The polynucleotide may be the full length sequence or a subsequence of a native or heterologous structural or regulatory gene. Unless otherwise indicated, the term includes the specified sequence as well as its complement. Thus, a DNA or RNA having a backbone modified for stability or for other reasons is a "polynucleotide" as that term is intended herein. In addition, a DNA or RNA that comprises a unique base, such as inosine, or a modified base, such as a tritylated base (to define just two examples) is a polynucleotide as the term is used herein.

The term "quantitative measurement" refers to determining the amount of an analyte in a sample. The term "quantitative" refers to the fact that the measurement may be expressed as a numerical value. The numerical value may relate to a measure, dimension, degree, quantity, volume, concentration, height, depth, width, extent, length, weight, volume, or area. Quantitative measurements may involve measuring the intensity, peak height or peak area of a signal, such as a chromogenic or fluorogenic signal, or any other quantitative signal. Generally, when determining the presence or form of an analyte, the measurement will involve an instrument signal. For example, when determining the presence of a SNP, the measurement will involve a hybridization signal, which measurement will typically provide a fluorescence intensity measured by a fluorometer. When determining the presence of an immune response, the measurement will involve measurement of antibody titer, which can also be provided, typically, as fluorescence intensity. The measurements need not provide continuous measurements, but may involve discrete intervals or categories. The measurement may also be semi-quantitative. As long as it can be in 2^n-1、3^n-1Or x^n-1The measurement is determined in partial differentiation (partial) and is preferably a proportional interval of maximum sample signal strength (depending on whether the cell is provided in an arithmetic series of common ratios 2, 3 or x, respectively, where n is the number of samples in the cell), which is theoretically suitable.

The term "pooling" as used herein refers to combining or pooling samples together for the most user-friendly purpose. In particular, the term "pooling" refers to preparing a collection of multiple samples to represent one sample with a weighted value. Multiple samples are typically combined into a single sample by mixing the samples. In the present invention, mixing requires careful weighing of the amount of a single sample, wherein the amount of analyte present in each sample is unambiguous. When the amount of analyte in sample A was 2g/L and the amount of analyte in sample B was 1g/L, the samples were combined in a 1: 6 volume ratio to provide a 1: 3 ratio of analytes.

When two samples are combined, for example, at a ratio of 1: 3, or when three samples are combined at a ratio of 1: 3: 9 as described in embodiments of the present invention, the possible frequencies of the variables in the pool are set by the interval endpoints of 12.5% and 3.85%, respectively. The endpoints of these intervals are referred to herein as "result points" and correspond to a step-wise increase in quantitative measurements (stepinclements) until the maximum sample signal strength is reached.

The term "geometric series" refers to a series in which the ratio between any two consecutive terms is the same. In other words, the next term in the series is obtained by multiplying the previous term by the same number each time. This fixed number is called the common ratio of the array. In the geometric series of the present invention, the first term is 1, and the common ratio is 2 or 3 according to the type of the sample.

The term "maximum sample signal intensity" refers to the signal from the pool when all samples pooled give a positive signal (i.e., when 100% of the individual samples are positive for the analyte tested). The maximum sample signal strength may be determined by any suitable method. For example, 50 individual samples may be measured separately to determine their composition of presence in these samples according to the number of discrete events, and then these samples may be measured in a pooling experiment, where the measured signal intensities of pooled samples are shown in the same proportion, obtained by summing the signal intensities of all individual samples.

The method of the invention can be performed with any number n of samples. In practice, however, the maximum number n is set according to the accuracy of the measuring method, that is to say the accuracy with which a reasonable statistical difference between two successive result points can be determined. The accuracy (standard deviation) of the method must be compatible with this.

Applications of the methods of the invention include, but are not limited to, genotyping methods. Genotyping based on pooled DNA has a variety of applications. Genotypes can be used for mapping, association and diagnosis of all species. Examples of specific genotyping include a) human genotyping, such as medical diagnostics, and follow-up individual genotyping after case-control study pooling; b) candidate genetic methods and genomics the genotyping of livestock in a wide range of screening applications, such as in QTL studies, and c) the genotyping of plants, for example, for profiling and association studies.

Pooling may also be used when sequencing humans, livestock, plants, bacteria, viruses. More specifically, when it is desired to compare sequences of two or more individuals, it is appropriate to pool individual samples for sequencing.

The method of combining samples according to the present invention comprises taking a sub-sample from at least one first sample and taking a sub-sample from at least one second sample, wherein said first and second sub-samples are mixed in the same container to provide a mixture of the two sub-samples in the form of a combined sample, wherein the ratio of said first sub-sample to said second sub-sample in said combined sample is 1: 3 or 3: 1, depending on the concentration of the analyte as described herein. Similarly, when three samples are combined (which term refers to the fact that three subsamples are mixed), the resulting combined sample has a ratio of the first, second and third subsamples (in any order) of 1: 3: 9 as described herein. The possible frequencies of the variables in the pool are set according to the interval end points of 12.5% and 3.85%, respectively. The endpoints of these intervals are referred to herein as "result points" and correspond to step increments (step increment) until the maximum sample signal strength is reached.

The merging method defined herein may be performed by (using) a merging device. Such a device should contain a sample collector for collecting and delivering a determined amount of sample, e.g. in the form of a determined (but variable) volume. A suitable sample collector is a pipette manipulator (pipette), such as an automated sample delivery and handling system commonly used in laboratories in general. Such automated systems are typically bench-top devices that should contain one or more of a microplate processor station, a reagent station, a filter plate aspirator, and a pneumatics-based automated pipette module and disposable tip. These sample automated systems are well suited for carrying out the methods of the present invention, as they are fundamentally designed for combining different liquid volumes from different samples into one or more reaction tubes. Therefore, they are within the skill of the skilled person to apply such automated pipette systems to the task of combining different liquid volumes from different samples into one single combined sample. However, such an automatic pipette system is just one suitable embodiment of a sample pooling device for pooling a plurality of samples into a pooled sample, said device comprising a sample collector for collecting samples from a plurality of sample vials and for delivering the samples to a single pooling vial to provide a pooled sample, and further comprising a processor for performing the pooled sample method as defined herein. The term "processor" as used herein is meant to include any computer device in which one or more execution units (e.g., components including a pipetting device and a robotic arm that moves the pipetting device between a sample vial and a merged vial of an automated pipetting system) are used to execute stored instructions and instructions retrieved from a memory or other storage device. The term "vial" shall be generic and may include reference to an analysis point on an array. Thus, the processors of the present invention may include, for example, personal computers, mainframe computers, network computers, workstations, servers, microprocessors, DSPs, Application Specific Integrated Circuits (ASICs), portions or combinations thereof, and other types of data processors. The processor is arranged for receiving instructions of a computer program implementing the method of merging samples of the present invention on a merging device as defined above.

Method of pooling samples for analysis of a categorical variable, wherein the analysis involves quantitative measurement of an analyte, said method of pooling samples comprising providing a pool of n samples, the amount of an individual sample in the pool being x of the analyte in the sample⁰∶x¹∶x²∶x^(n-1)Wherein x is an integer of 2 or more, which represents the number of classes of the categorical variable.

Although the merging method is very straightforward and can be expressed in relatively simple formulas, the method of analyzing the merged samples described herein is complex.

As described herein, a categorical variable (e.g., genotype) may take the value of one of several possible categories (BB, AB, AA). These categories are consistent with the categories of the result interval. The classification can be determined by performing quantitative measurements on parameters (e.g., fluorescence) of the analyte (DNA), and assigning classes to these parameters based on a classification of the analysis results, each class representing a variable of the categorical variable (see fig. 7).

Overall, the total number of possible analysis results (outputs) depends on the nature of the categorical variable. For example, in the case of diploid biotypes, the ploidy level determines the number of possible analysis results. In general, the nature of the categorical variable may include the presence of a different number of variables or series of analytes in the sample (see also fig. 7). The total number of possible analysis results also depends on the possible different classification values that can be used in one iteration. Table 1 provides an example of the number of possible analysis results.

TABLE 1 Total number of possible analysis results (outcomes) when the measurements are made of repetitions of the same event

N represents the number of possible classification values or variables for a repeat and k is the number of repeats within a sample. The values provided in the table are according to the formula (ⁿ⁺k^k+1) And (4) calculating.

For example, diploid individuals (2 repeats of one allele in a sample) have a genotype of 3 (AA, AB and BB) because one allele can only have two different variables (a or B). Triploids (3 repeats of one allele) can have 4 different genotypes (AAA, AAB, ABB and BBB)

The individual's blood type is a repeat (A, B, AB or O) with four different variables.

The formula in table 1 holds true for the case where the measured variable repetition is not important. For example, there is no difference between genotype AB and genotype BA for genotypes. However, in the case where the repetitive feature (identity) is important, the formula for calculating the total number of possible analysis results is n^k. This formula replaces the formula (a) of table 1ⁿ⁺k^k+1). And all values in the table are changed accordingly. For the case of 2 replicates and 2 possible outcomes per replicate, there are four outcomes. For 3 replicates and 3 possible outcomes per replicate, there would be 9 different outcomes.

The total number of possible analysis results is used herein as the pooling ratio (e.g., 1: 3: 9) and is provided directly referred to as the "pooling factor" (3 in the case of 1: 3: 9). For example, when haploid individuals are pooled for genotyping, there is one duplication, with 2 possible variables per duplication. In this case, the merging factor is equal to 2 (being the number of results in table 1).

The combination of 4 individuals needs to be 2⁰∶2¹∶2²∶2³Is carried out in the same manner as described above.

When diploid individuals are pooled, the pooling factor is 3. Merging 3 individuals requires 3⁰∶3¹∶3²Is carried out in the same manner as described above.

The total number of results in the pool is then equal to the following equation:

the total combination result is the combination factor^{Number of samples}。

The increase in signal strength (increment) is equal to:

increase by 1/(combining factor)^{Number of samples}-1)＊100％

Or

1/(ykey)⁰+ (Merge factor)¹+ (Merge factor)²+.^(n-1)))＊100％，

Where n is the number of samples and y is the combining factor-1.

If the measured intensities are present for all variables of a repetition (one subtracted for all values, since the one subtracted would then be calculated as 1 minus the other intensity), the first row in table 1 may be followed, since this may be considered the presence or absence of each value of the repetition, which corresponds to 2 possible outcomes of the repetition. See the above example, where 3 possible alleles are assumed instead of 2, and 3 different light intensities can be measured instead of 2 (red and green).

If only a single measurement is made, table 1 may be followed.

As described herein, the methods of the invention for analyzing pooled samples comprise performing a measurement of a desired analyte on the pooled sample. After recording the measurement results (e.g. instrument signals), the analysis comprises a series of steps, which are explained in detail in the examples provided below.

Analysis of a series of pooled samples obtained by the method of the invention, wherein analysis of the categorical variables of the samples involves quantitative measurement of the analyte in the samples. The analyte is a chemical or physical substance or entity, the parameter of which indicates the presence or absence of at least one variable of said categorical variable. For example, when the genotype of an organism with a variable allele a or B is determined as a categorical variable, the analyte is the DNA, DNA probe or genetic marker of the organism and the absolute value of the parameter of the analyte is directly related to the presence (or absence) of the variable. Quantitative measurements of analytes typically include fluorescent intensity, radioisotope intensity, or any quantitative measurement that is a value of an analyte parameter. A measurement value that exceeds a certain threshold or classification value generally indicates the presence of a variable. Thus, a quantitative measurement of an analyte in a sample is a signal that the analyte signals the presence or absence of a variable of a categorical variable analyzed in the sample.

Basically, in the method of analyzing a pooled sample obtained by the method of pooling samples described herein, the proportion of individual samples in the pool (i.e. the results of individual samples in the pool) is determined as follows.

The maximum sample signal intensity for a particular analysis "a" performed on a pool of n samples is first determined and set to 100% signal. The maximum sample signal strength is the signal strength achieved when 100% of the n samples in the pool are positive (positive) for the categorical variable. The maximum sample signal strength can be determined by providing a test cell of n positive reference samples and determining the measurement signal, wherein the positive reference samples are positive for the categorical variable, and wherein n is the number of samples in the cell on which the analysis "a" is performed. The maximum sample signal strength for analysis "a" is recorded or stored in computer memory for later use. Next, by analyzing "a", the analyte of interest is measured in the pooled sample obtained by the method of the present invention, thereby determining the pooled sample signal intensity of the analyte. Recording analyte gain in pooled samplesRounded to the nearest result point determined above and stored as appropriate and then compared to the maximum signal strength. The comparison is suitably made in this way. In general, each possible measurement is assigned the value 1/(y (3)⁰+3¹+3²+3^(n-1)) 100%, where n is the number of pooled samples, y is an integer 2 representing the presence or absence of "a", and 100% is the maximum sample signal intensity. y (3)⁰+3¹+3²+3^(n-1)) Is to be understood as meaning y (3)⁰+3¹+3²+3ⁱ+3^(n-1)) Where n is the number of samples and i is an increasing integer having a value between 2 and n. For example, for a categorical variable of 2 categories (no or present markers), and a pool with 4 samples, the maximum sample signal intensity was set to 100% using 4 positive reference samples, for a total of 2 (3)⁰+3¹+3²+3³) 2+6+18+54 80 result points, where each possible measurement result may be assigned 1/80 × 100% ═ 1.25% values or multiples thereof.

The results for each sample in the sample pool can be read from a simple results table (stored in computer-readable form in computer memory) with incremental steps 1/(y (3) between 0% and 100% of the maximum sample signal intensity⁰+3¹+3²+3^(n-1)) 100%) assigned a value for each individual sample in the pool. For example, such a results table is the table provided in table 2 below.

The analysis is done by assigning a categorical variable to each subsample in the merged sample.

The method of analysing pooled samples as defined herein may be carried out by an analysis device. The analysis device of the present invention comprises a processor for analyzing a series of pooled samples obtained by the method of pooling samples described above, wherein the device is used to analyze the categorical variable of the sample and perform a quantitative measurement of the analyte in the sample. As mentioned above, the unique property of this assay device is that it is used to analyze the categorical variables of the pooled sample in each individual sample in the pool and to perform a quantitative measurement of the analyte in the sample. Basically, the analysis means are arranged to measure and analyze the measurement results obtained from the pooled samples and to derive from the results a categorical variable for each individual sample in the pool. Such a device should comprise a signal reading unit for measuring the analyte signal in the pooled sample. The analysis device should also contain a memory for storing the measurement results and the result table described above. The analysis device should also comprise a processor for retrieving information from the memory and/or the reading unit, and for performing the calculations and performing the iterative process, wherein the measurement results of the pooled sample are compared with the corresponding results of the individual samples in the pool using the result table mentioned above and assigned to the respective results of the individual samples; input the sample information to an input/output interface in a memory or processor; and a display coupled to the processor. The processor is used to receive program instructions from the computer that implement the method of analyzing samples of the present invention on the analysis device described above. The term "processor" as used herein refers to any computing device that includes instructions retrieved from memory or other storage means for execution using one or more execution units, such as a signal reading unit that receives a pooled sample and performs a measurement of an analyte in the sample or pooled sample by determining a signal for the analyte.

The analysis device of the present invention may further comprise a combining device of the present invention.

The present invention also provides a computer program product on its own or on a carrier, which when loaded and executed on a computer, a programmed computer network or other programmable apparatus, can carry out the method of merging samples as described above. Basically, the computer program product may be stored in a memory of the merging device of the invention and the processor may execute the program by providing the processor of the device with a series of instructions corresponding to the processing steps of the merging method.

The invention also provides a computer program product on its own or on a carrier, which when loaded and executed on a computer, a programmed computer network or other programmable device, can carry out a method of analysing a plurality of samples, the method comprising analysing a series of pooled samples obtained by the method of pooling samples described hereinabove, wherein a categorical variable of the samples is analysed, and involving carrying out a quantitative measurement of an analyte in the samples. Basically, the computer program product may be stored in a memory of the merging device of the present invention and the processor may execute the program by providing the processor of the device with a series of instructions corresponding to the individual processing steps of the analysis method. In a computer program product for performing an analysis, the method embedded in software instructions may further comprise the step of combining samples as described above.

The invention will now be illustrated by the following non-limiting examples.

Examples

Example 1

Example of genotyping a diploid individual sample for the presence of SNPs Using a standardized pool of 50 individuals

Step 1) separately detecting 50 individuals

For each SNP and each individual, we used two different fluorescent dyes in microarray format, and obtained the intensity of red fluorescence (presence of allele) and green fluorescence (absence of allele). The alignment between red and green intensities is not always 1 (or 0) for homozygous animals or 0.5 for heterozygous animals.

The data of individual typing was used to calculate the correction factors for the signal intensity of all typed SNPs.

To obtain the most important correction factor (K), which is usually used to correct data representing any unequal efficiencies in alleles, we use signals from heterozygous genotypes. If no heterozygous genotype exists, we assume that the SNP being studied is not segregating in the population being studied, and therefore the results of that SNP in the pool should be ignored.

Ignoring SNPs due to the absence of heterozygotes in 50 individual samples would result in the loss of information on SNPs with low MAF (few allele frequencies). For many applications (e.g. extensive selection of genomes) this has no impact, since SNPs with very low minority allele frequencies do not have a very large impact on accuracy, and therefore it can be decided not to use the data on these SNPs or not to apply correction factors.

We use a first correction factor (K) of;

K＝avg(Xraw/Yraw)

where Xraw is the measured intensity of red and Yraw is the measured intensity of green. This value was determined from a sample of the individual genotyping with genotype AB.

Instead of using the average results for all microbeads of one genotype, we can also use the results for all individual microbeads. Thus, we used the average results of Xraw and Yraw or X and Y from one sample, or we used the results of all individual microbeads of that sample.

Other correction factors are AAavg and BBavg. AAavg is the average of uncorrected allele frequencies for AA genotypes. This value is expected to be close to 1. BBavg is the average of uncorrected allele frequencies for the BB genotype. This value is expected to be close to 0. AAavg and BBavg were calculated using the following formulas:

AAavg＝(avg(Xraw/(Xraw+Yraw)))

and

BBavg＝(avg(Xraw/(Xraw+Yraw)))

step 2) a test pool was constructed comprising all 50 individuals of step 1) above. For this purpose, the DNA concentration (ng/. mu.l) in each individual sample was measured using a NanoDrop spectrophotometer (NanoDrop Technologies, USA). All DNA samples were then diluted to a standard concentration of 50 ng/. mu.l before being combined into a single sample. In the test cell thus obtained, we estimated the allele frequencies of the uncorrected or correction factors obtained according to the first step.

The uncorrected allele frequency for allele a was calculated as the ratio of the red intensity divided by the sum of the two intensities as follows:

uncorrected allele frequency Xraw/(Xraw + YRaw)

We used a first correction value for allele frequency of

Corrected allele frequency Xraw/(Xraw + K Yraw)

The second correction value we adopt is normalization.

Normalized allele frequency ═ (corrected allele frequency-BBavg)/AAavg

For both calibration and normalization, we used all 3 genotypes for each SNP, separately from individual samples.

The order of accuracy of the estimated allele frequencies was: normalized (most accurate), corrected (between the two) and uncorrected (least accurate)

This means that the correction factor K is set to 0.5 if there are no heterozygous individuals in step 1, and the correction factors AAavg and BBavg are set to 1 and 0, respectively, if there are no homozygous individuals.

Step 3) we compared allele frequencies calculated from individual typing and based on the results in the test pool. From this we estimate a fourth order polynomial with the actual result on the X-axis. From figure 1, the genotyping results in individuals tested alone and in pools with nearly 18000 SNPs can be seen. Uniform partitioning Using SNPGenotyping was performed by 18K Chiken SNP iSelect Infinium assay (Illumina Inc, USA) distributed throughout the Chicken genome (vanAs et al, 2007). Illumina website (available from Illumina)http://www.illumina.com/pages.ilmnID＝12) And finding detailed information of detection, operation flow and chips.

When the known frequencies of the individuals are 0, 0.05, 0.1, 0.15- - -0.9, 0.95 and 1, we have calculated the expected allele frequencies in the test pool by this polynomial.

Referring to fig. 2, putting these results in the second graph together with the actual frequency on the Y-axis, we get the correction factor for the third correction step.

Referring to fig. 3, after applying these correction factors, the allele frequencies in the test pool showed a linear relationship with the actual frequencies.

In this experiment of about 18,000 SNPs, the allele frequencies measured in the test pool of 50 individuals (and corrected as above) were in the range of + 6.25% or-6.25% compared to the results of individual typing.

For the application of the present invention, the first 3 steps are preferably performed before the actual analysis as "calibration", thereby improving the accuracy of the analysis. But these steps are not required to be performed every time. Then, the measurement calibration (if performed) is followed by the following steps:

step 4) at a ratio of 1: 3, 1: 3: 9 or 1: 3¹∶3²∶3^(n-1)Ratio of 2, 3 or n individuals and measuring the pool for genotyping, wherein the signal intensities for red and green were determined on the chip using the 18K Chicken SNP iSelect Infinium assay (see above).

Step 5) the allele frequency can be calculated from the signal intensities in the resulting pools by the correction factors obtained in step 1 and step 3.

The predicted corrected frequencies yielded 0%, 12.5%, 25.0%, 37.5%, 50.0%, 62.5%, 75.0%, 87.5% and 100% result points for pools with two individuals. Rounded to the nearest result point. The genotypes of both individuals can be obtained from the results shown in table 2.

For pools with 3 individuals, rounding to the nearest result point, where the interval between result points is 3.85% (100/(3)³-1)), and the like.

The smaller the spacing between successive result points, the higher the accuracy required for reading the intensity, so that a certain result is reasonably assigned to one of the result points. With the further development of genotyping technology, more accurate reads will become feasible.

For the case of 2 individuals in a pool, it may be decided to use only SNPs, where the estimated and corrected allele frequencies in the pool fall within + -6.25% of the actual frequency of the individual (see red line in FIG. 3).

TABLE 2 results points of allele frequencies for pooled samples and deduced genotypes for two individuals in the pool for SNPs with A and C alleles

Frequency of allele A in pooled samples	Deduced genotype of Individual 1 (1 in the pool)	Deduced genotype of individual 2 (3 in the pool)
			0	CC	CC
12.5	AC	CC
			25	AA	CC
37.5	CC	AC
			50	AC	AC
62.5	AA	AC
			75	CC	AA
87.5	AC	AA
			100	AA	AA

SNPs showing a difference of more than 6.25% between the pooled results and the individual results should be omitted if there is no other information to deduce the individual genotype (step 3).

Other information to deduce the genotype of an individual may come from the pedigree of the individual or information of haplotypes in a family (or family) or population (or population) to which the individual belongs.

Depending on the repeatability of the calibration factor, a new assay with the same known detection conditions can skip steps 1, 2 and 3 entirely.

When following the method of example 1, significant savings are achieved by reducing the total number of samples that need to be analyzed, while still obtaining reliable results for the original individual samples. The general reduced total number of samples analyzed is exemplarily shown in table 3.

TABLE 3 saved number of samples analyzed when 2 or 3 individuals were pooled according to the method of the invention

Example 2

Example of genotyping diploid individual samples using 25 pools of standardized 2 individuals

Step 1) 50 individuals were tested individually as in step 1 of example 1.

Step 2) 25 pools were constructed with a 1: 3 ratio, 2 samples in each pool, which included all 50 individuals of step 1) above. In these pools, the allele frequencies are estimated uncorrected or based on the correction factors obtained in the first step.

Step 3) the sum of the allele frequencies of the 2 individual genotypes was compared to the estimated frequency in the pool with 2 individual samples. From these 25 points, a regression line was calculated. The regression coefficients and the intercept can then be used to correct the estimated frequencies of the other pools.

Step 4) is then carried out at a ratio of 1: 3, 1: 3: 9 or 1: 3¹∶3²∶3^(n-1)Ratio of (A to (B)Example DNA pools of 2, 3 or n samples were constructed.

And 5) calculating the allele frequency of the signal intensity obtained in the pool by using the correction factors obtained in the step 1 and the step 3.

The reduced number of samples is consistent with the reduced number mentioned in table 8 for sequencing of diploid individuals.

Example 3

Examples of genotyping haploid Individual samples

When two haplotype samples were pooled and the presence of allele a at certain positions of the genome was measured, the ratio in the expected measurement (peak height, surface area, intensity) was;

TABLE 4. result points of allele frequencies of pooled samples and deduced genotypes of two individuals in pools of SNPs with A and C alleles

Frequency of allele A of pooled samples	Deduced genotype of Individual 1 (1 in the pool)	Deduced genotype of individual 2 (3 in the pool)
			0.00	C	C
0.33	A	C
			0.67	C	A
1.00	A	A

If only two sample cells are used, no correction factor may be needed. When more samples are combined, a correction factor may be needed. It can be calculated by a pool of 2 samples of mock heterozygous and homozygous diploid individuals with equal amounts of analyte.

When 3 samples were combined at a ratio of 1: 2: 4, the ratios in the expected measurements were as follows;

TABLE 5 results points of allele frequencies of pooled samples and deduced genotypes of three individuals in pools of SNPs with A and C alleles

Frequency of allele A of pooled samples	Deduced genotype of Individual 1 (1 in the pool)	Deduced genotype of Individual 2 (2 in the pool)	Deduced genotype of individual 2 (4 in the pool)
				0.000	C	C	C
0.166	A	C	C
				0.333	C	A	C
0.500	C	C	A
				0.666	A	C	A
0.833	C	A	A
				1.000	A	A	A

Example 4

Application of the invention in sequencing test protocols

The merging method described in the present invention can be applied to a case where the sequence of 2 or more individuals needs to be determined.

Sequencing pooled individuals, templates or PCR products is not routine because an important issue when analyzing a double trace is that there are two bases at each position, and it is not possible to tell from which template each base came by merely exemplifying the trace.

In addition to judicious pooling of templates that produce a bimodal map, several biological or biotechnological scenarios are known to produce bimodal maps. These events were observed in alternative splicing regions of transcripts amplified by RT-PCR, direct sequencing (not cloned), and random insertional mutagenesis experiments.

Several methods have been described to trace back the haplotype of the pooled sequences or bimodal maps. Flot et al (2006) describe several molecular approaches proposed for finding individual haplotype. For example, sequencing of PCR products of clones (e.g., Muir et al, 2001), SSCP (single-strand conformation polymorphism) (Sunnucks et al, 2000), Denaturing Gradient Gel Electrophoresis (DGGE) (Knapp 2005), extreme dilution of DNA to the single molecule level (Ding & Cantor2003), and the use of allele-specific PCR primers (Petterson et al, 2003). Several computational methods for the haplotype reconstruction of sequence mixtures have also been proposed.

However, all of the methods described are very expensive and time consuming and are only suitable for specific purposes (e.g. resequencing, alternative splicing, template or PCR amplification mixtures of two products of different sequence lengths, availability of reference genomic sequences) and not for standard direct sequencing of haploid or diploid samples or resequencing of completely unknown sequences.

The pooling of pooled sequence templates following the description of the present invention is applicable to situations where identical sequence fragments can be obtained in both individual and pooled samples. This indicates that, for example, shotgun sequencing (random splicing) is not suitable for pooling.

In all the applications mentioned above, if pooled based on a certain purpose application, equal amounts of template (sample, DAN, RNA or PCR product) are pooled.

Herein, we describe merging unequal quantities of templates. For this example, only the case where the pool is composed of 2 templates is described, but the invention can be used so as to be 1: 3, 1: 3: 9, 1: 3 for diploid organisms¹∶3²∶3^(n-1)In a ratio of 1: 2, 1: 2: 4, 1: 2 for haploid organisms¹∶2²∶2^(n-1)Ratio of (a) in the case of constructing pools of DNA (or post-PCR products) of 2, 3 or n individuals.

The general condition to be met is that the sequencing equipment scans the template (e.g., fluorescence) and the resulting chromatogram represents the sequence of the DNA template as a regularly spaced, highly similar series of peaks.

Step 1) sequencing reactions of 50 individuals individually

The data of the individual sequencing reactions are used to calculate correction factors from the peak areas and peak heights for all base (or nucleotide) positions.

Step 2) sequencing reaction of 25 pools of 2 pooled individuals

The peak area ratio is used to distinguish the first and second peaks at the base and noise peaks. The second peak is part of the first peak and a threshold is used to distinguish the peak from a noise peak.

The data of the pooled sequencing reactions were used to calculate correction factors from peak areas and peak heights at all base (nucleotide) positions.

Step 3) plotting the results of steps 1 and 2 and establishing a regression line (calculating regression coefficients and intercept).

Step 4) construction of a pool of DNA (or post-PCR products)

For diploid organisms at a ratio of 1: 3, 1: 3: 9,1∶3¹∶3²∶3^(n-1)In a ratio of 1: 2, 1: 2: 4, 1: 2 for haploid organisms¹∶2²∶2^(n-1)Ratio of (a) A pool of DNA from 2, 3 or n individuals was constructed.

Step 5) the base call (basecalling) can be calculated from the signal intensities obtained in the pool using the calibration factors obtained in step 1, step 2 and step 3.

In this example, only 2 possible nucleotides (a and C) at each base position are shown, but the same principles can be applied to other combinations of 2 of the 4 available nucleotides that underlie the genetic code. The average peak height of "a" nucleotides was set to 100, while the average peak height of "C" nucleotides was set to 75. Based on these peak heights, the relative peak heights for each possible combination of nucleotides in the two haploid pools are listed in table 6. The relative peak heights of pools consisting of two diploid templates are provided in table 7.

TABLE 6 deduced genotypes of the result points of allele frequencies and random positions in the nucleotide sequences of pooled and pooled haploid individuals

TABLE 7 deduced genotypes of the result points of allele frequencies and random positions in the nucleotide sequences of pooled and pooled diploid individuals

Comparing the present pooling method to the non-pooling case, table 8 shows the reduced number of sequencing reactions.

TABLE 8 reduced number of samples or sequencing reactions when 2 individuals were pooled following the method of the invention

Example 5

Example of genotyping diploid individual samples using an alternative calibration method using standard pools of 1 50 individuals and 25 pools of 2 individuals. This example describes several experiments.

Step 1) 50 individuals were tested individually.

Same as step 1 of example 1, but different from the calibration method: normalized intensities X and Y were used instead of Xraw and Yraw.

A first correction factor (K) is calculated using X and Y.

K＝avg(X/Y)

Where X is the normalized intensity of allele A (red) and Y is the normalized intensity of allele B (green). This value was determined from a sample of individual genotypes with genotype AB.

Other correction coefficients AAavg and BBavg are also based on X and Y. AAavg is the average of uncorrected allele frequencies for AA genotypes. This value is expected to be close to 1. BBavg is the average of uncorrected allele frequencies for the BB genotype. This value is expected to be close to 0. AAavg and BBavg were calculated using the following formulas:

AAavg＝(avg(X/(X+Y)))

and

BBavg＝(avg(X/(X+Y)))

all correction factors K, AAavg and BBavg can also be calculated from Xraw and Yraw in step 1 of example 1.

If there is no genotype AA in 50 individuals, AAavg is set to 1. Likewise, if there is no genotype BB, BBavg is set to 0.

The next step is to calculate the allele frequencies based on the individual typing of those SNPs in which all 50 individuals had a result.

Step 2) a pool of all 50 individuals from step 1 was constructed as in step 2 of example 1.

Uncorrected allele frequency for allele a was calculated as the ratio of normalized red intensity (X) divided by the sum of the two normalized intensities (X + Y).

Uncorrected allele frequency X/(X + Y) (referred to as Raf)

The first correction we apply to allele frequencies is

Corrected allele frequency X/(X + K Y) (referred to as Rafk)

If there is no heterozygous genotype, K may not be calculated. In this case, the following rule can be applied:

if Raf < 0.1, Rafk is set to 0.

If Raf > 0.9, Rafk is set to 1.

In all other default K cases, Rafk is set equal to Raf.

When starting with normalized intensities X and Y, it is not always necessary to use AAavg and BBavg for normalization correction. If starting with Xraw and Yraw, normalization using AAavg and BBavg can be applied as in step 2 of example 1.

If normalization is applied, the following formula is used;

normalized allele frequency ═ (corrected allele frequency-BBavg)/AAavg (referred to as Rafn)

Step 3) we compared the expected (expected) allele frequencies calculated for the individual typing in step 1 with the observed (corrected or uncorrected) frequencies from the results in the 50 pools in step 2. We calculated the regression coefficients using the following model;

expected allele frequency + observed frequency of b1 + observed frequency of b2²+ b3 ANGSTROM observed frequency³+ b4 ANGSTROM observed frequency⁴Without intercept

Corrected frequencies (Rafk and Rafn) or uncorrected frequencies (Raf) are used as the observed frequencies in the above equations.

By comparing the expected allele frequencies with those predicted from the model, the best correction method (Rafk, Rafn or Raf) can be obtained.

Thereafter, the regression coefficients of the best correction method will be used to correct the allele frequencies of the pool of 2 individuals in step 5 a.

Step 4) A DNA pool of 25 2 individuals was established from 50 individuals in a ratio of 1: 3. It should be noted which individual in the pool used once and which individual used 3 times.

Step 5a) correction of the results based on a pool of 50 individuals.

With the correction factors obtained in step 1(K, AAavg and BBavg) and step 3 (regression coefficients b1, b2, b3 and b4), allele frequencies can be calculated from the signal intensities obtained in the pools constructed in step 4. Raf or Rafk or Rafn is first calculated using the correction factor K, AAavg and BBavg of step 1 (best correction method from step 3).

Calculating Rafc or Rafkc or Rafnc as

Expected allele frequency + observed frequency of b1 + observed frequency of b2²+ b3 ANGSTROM observed frequency³+ b4 ANGSTROM observed frequency⁴Where the observed frequencies are Raf or Rafk or Rafn.

The predicted corrected frequency should provide result points of 0%, 12.5%, 25.0%, 37.5%, 50.0%, 62.5%, 75.0%, 87.5% and 100% for two individuals in the pool. Rounded to the nearest result point. The genotypes of the two individuals can be derived from the results shown in table 2 of example 1.

Step 5b) correction of the results based on a pool of 2 individuals

Raf, Rafk, and Rafn are calculated from the signal intensity of the pool constructed in step 4 and the correction factors K, AAavg and BBavg obtained in step 1.

Example 5 can be calculated from 20 pools using the same model's polynomial regression coefficients as step 3. The model can be applied to each SNP individually or to all SNPs.

The allele frequencies in the other 5 pools were predicted from these regression factors to be:

Rafkc＝b1＊Rafk+b2＊Rafk²+b3＊Rafk³+b4＊Rafk⁴(regression model from Rafk)

Rafn＝b1＊Rafn+b2＊Rafn²+b3＊Rafn³+b4＊Rafn⁴(regression model from Rafn)

Rafc＝b1＊Raf+b2＊Raf²+b3＊Raf³+b4＊Raf⁴(regression model from Raf).

This can be repeated 5 times in such a way that all samples are used for prediction once. The expected allele frequencies in these pools were then compared to the predicted allele frequencies to find the best correction.

In a pool with two individuals, the predicted corrected frequency should provide result points 0%, 12.5%, 25.0%, 37.5%, 50.0%, 62.5%, 75.0%, 87.5%, and 100%. Rounded to the nearest result point. The genotypes of the two individuals can be derived from the results shown in table 2 of example 1.

Step 5c) correction of the results based on a pool of 2 individuals.

Another way of prediction can be made by using multiple linear regression coefficients for SNPs of light intensity (X or Xraw and Y and Yraw) based on the following model.

Expected allele frequency of b 1X + b 2Y

Or

Expected allele frequency is b1 Xraw + b2 Yraw.

These multiple linear regression factors can be used to predict allele frequencies using the following formula:

predicted allele frequency + b 1X + b 2Y

Or

The predicted allele frequency is the intercept + b1 Xraw + b2 Yraw.

As described above, multiple linear regression coefficients were calculated based on the 20 pools.

Allele frequencies of the other 5 pools were then predicted from these regression coefficients. This can be repeated 5 times in such a way that all samples are used for prediction once. The expected allele frequencies in these pools were then compared to the predicted allele frequencies to find the best correction.

For example, in step 5a and step 5b, the genotypes of the two individuals may be derived from the results shown in table 2 of example 1.

Step 6) a pool of 2 individuals of DNA was established from the other individual samples in a ratio of 1: 3. Note which individual in the pool has been used once and which individual has been used 3 times, as in step 4.

We were able to derive genotypes from these pools using the best correction method for predicting allele frequencies as described and using table 2 of example 1.

Experiment 1

The examples were prepared using the Infinium detection bead chip technology (Illumina, inc. usa) The method described in 5 was applied to genome-wide SNP analysis.

50 individuals were genotyped using the 18K Chicken SNP iSelect Infinium assay (Illumina Inc, USA) in which SNPs are evenly distributed throughout the Chicken genome (van As et al, 2007). Detailed information of detection, operation flow and chip can be found on Illumina's website (R) ((R))http://www.illumina.com/pages.ilmnID＝12)。

To check whether the frequency was accurately estimated, 8 alleles were combined into one pool (4 different animals out of 50 independently genotyped individuals). Steps 1 through 3 and 5 in example 5 were performed except that the predicted alleles were not translated into genotypes using table 2.

In step 4, 4 individuals of equimolar amounts of DNA were pooled, rather than 2 individuals at a 1: 3 ratio.

If a 1: 3 ratio from 2 different animals is used, we can consider this to be a combination of 8 alleles in one pool. By using equimolar amounts of 4 individuals, 8 alleles can also be combined.

Thus, 12 pools were made, and one 50 animal pool as in step 1 (the same sample plus 2 additional samples were used in 4 pools). The 13 pools were then genotyped using a second batch of infinium chips.

K, AAavg and BBavg were calculated for each SNP as in step 1 of example 5.

Uncorrected and corrected allele frequencies were then calculated for pools of 50 animals as in step 2 of example 5.

The polynomial regression coefficients were also calculated as in step 3 of example 5.

Further, as described in steps 5b and 5c, multiple regression coefficients and multiple linear regression coefficients are calculated. This was done on the basis of 11 pools, and then regression factors were used to predict allele frequencies in the remaining pools.

In this experiment, multiple linear regressions of X and Y (red and green intensities) yielded the best results. The final results are shown in FIG. 4 and Table 9.

A total of 4.6% of the allele frequencies fall within the error class (wrong class).

In the case of pools of 2 individuals pooled at a 1: 3 ratio, a genotyping error of 3.0% occurred.

TABLE 9 number of predicted allele frequencies compared by class to expected allele frequencies. The diagonal numbers will yield the correct genotype. The off-diagonal but in-frame allele frequencies produce a genotype error. Other results will produce 2 genotype errors.

Experiment 2

The method described in example 5 was applied to SNP analysis using the Veracode detection technique (Illumina, inc.

50 individuals were genotyped using a 96Chicken SNP Veracode, Golden Gate detection (Illumina Inc, USA) in which the SNPs were evenly distributed throughout the Chicken genome (step 1). Detailed information of detection, operation flow and chip can be found on Illumina's website (R) ((R))http://www.illumina.com/pages.ilmnID＝6)。

A pool of all samples (step 2) and 24 pools of 2 individuals in a 1: 3 ratio (step 4) were also constructed. The 25 pools were genotyped with a second batch of chemicals.

All corrections were made as described in step 1 to step 3 of example 5.

The correction of step 5a was applied to all pools of 24 2 individuals using the polynomial regression factors obtained in step 3.

For steps 5b and 5c, we used 23 pools at a time to calculate regression factors (multiple regression factors in step 5b and multiple linear regression factors in step 5c) to enable prediction of allele frequencies for the remaining pools. We performed 24 times in total and all pools were used once to predict allele frequencies.

The best results were obtained using Rafk (calculated from the normalized values X and Y) and then corrected using the polynomial regression factor of step 5b that resulted in Rafkc.

A total of 84 SNPs were recruited (call) among individuals. While certain SNPs are not recruited in certain individuals. We total 1906 complete pool-by-SNP combinations.

TABLE 10 number of predicted allele frequencies by category comparison with expected allele frequencies. The diagonal numbers will yield the correct genotype. The off-diagonal but in-frame allele frequencies produce a genotype error. Other results will produce 2 genotype errors.

There were a total of 138(138/1906 × 100 ═ 7.2%) mismatches (table 10). Since each observation consisted of 2 individual samples, this resulted in 174 genotypic errors (170/1906 × 2 × 100 ═ 4.46%), see table 11, fig. 5, and fig. 6.

The procedure to determine the best correction method in this example (as performed in step 3 (example 5) and steps 5a, 5b or 5c (example 5)) also provides information on the number of mismatches due to the SNP. This allows SNPs to be removed from the series reducing the risk of error at the expense of a reduced detection rate.

TABLE 11 number of corrected predicted genotypes

Experiment 3

The method of example 5 was applied to SNP analysis using other genotyping methods.

The method described in example 5 can also be used in any other genotyping method, such as the Affymetrix Gene chips (Affymetrix Inc, USA) or Agilent Technologies, in addition to the methods described in experiment 1 and experiment 2.

Example 6

The invention is applied to a sequencing protocol as in example 4, but using other calibration methods

Step 1) 50 individuals were subjected to a sequence reaction individually.

The peak heights of allele 1 and allele 2 were used as Xraw and Yraw values, or the relative peak heights were used as X and Y.

The relative peak height of allele 1 is X ═ X/(X + Y), and the relative peak height of allele 2 is Y ═ Y (X + Y).

K, AAavg and Bbavg were then calculated in the same way as for genotyping in step 1 of example 5.

Step 2) sequence reactions were performed in a pool of all 50 individuals.

Uncorrected and corrected allele frequencies were calculated as in step 2 of example 5.

Step 3) sequencing from individuals and calculating the frequency from the pool.

The same model as in step 3 of example 5 was used to obtain multiple regression coefficients.

Step 4) 25 sequential reactions with pools of 2 pooled individuals were performed.

Step 5a) the corrected frequency is compared with the expected frequency based on the pool of all 50 individuals, resulting in the best method.

Step 5b) Rafnc, Rafkc and Rafc in 5 pools with 2 individuals were calculated using multiple regression factors obtained in the other 20 pools using the following model.

Step 5c) the predicted allele frequencies of 5 pools with 2 individuals were calculated using the multiple linear regression coefficients obtained in the other 20 pools using the following model.

Predicted allele frequency + b 1X + b 2Y

Or

Predicted allele frequency + b1 Xraw + b2 Yraw

The best correction method is determined from step 3 and step 5 by repeating steps 5b and 5c several times in such a way that all pools are used for prediction of allele frequencies (confirmation).

Other numbers for acknowledgements may be used if desired. For example, 24 pools can be used to obtain regression factors, which are then used for prediction.

A total of 25 replicates were required.

By optimal calibration methods and the required calibration and regression factors, the frequency of the new pools can be predicted and the alleles obtained in table 2 read.

Claims

1. A method of pooling samples for analysis of a categorical variable, wherein the analysis involves quantitative measurement of an analyte, the method of pooling samples comprising providing a pool of n samples, wherein the amount of individual samples in the pool is such that the analyte in the sample is present in x⁰∶x¹∶x²∶x^(n-1)Wherein x represents the number of classes of the categorical variable, which is an integer of 2 or more, and wherein the analyte is a biomolecule and the categorical variable is a variable of the biomolecule.

2. The method of claim 1, wherein the biomolecule is a nucleic acid.

3. The method of claim 2, wherein the variable is a nucleotide polymorphism of the nucleic acid.

4. The method of claim 3, wherein the nucleotide polymorphism is an SNP.

5. The method of claim 2, wherein the variable is a base signature of a particular nucleotide position.

6. The method of any one of the preceding claims, wherein the quantitative measurement comprises a measurement of intensity, peak height, or peak area of an instrument signal.

7. The method of claim 6, wherein the instrument signal is a fluorescent signal.

8. Use of the method according to any one of claims 1 to 7 for genotyping an allelic variable of a haploid or polyploid individual, wherein the number of classes (x) of the categorical variable is equal to p +1, wherein p represents a ploidy level.

9. Use according to claim 8, wherein x is 3 for genotyping of allelic variables in diploid individuals.

10. A method of performing an analysis on a plurality of samples, comprising pooling the samples according to the method of any one of claims 1-7, thereby providing pooled samples and performing the analysis on the pooled samples.

11. A method of analyzing a plurality of samples, comprising analyzing a series of pooled samples obtained by the method of any one of claims 1-7, wherein a categorical variable of the samples is analyzed and relates to a quantitative measurement of an analyte in the samples.

12. The method of claim 11, further comprising inferring, from the measurements, a contribution of the individual samples in the sample pool.