WO2005075676A1 - Sequence-specific dna analysis - Google Patents

Sequence-specific dna analysis Download PDF

Info

Publication number
WO2005075676A1
WO2005075676A1 PCT/GB2005/000176 GB2005000176W WO2005075676A1 WO 2005075676 A1 WO2005075676 A1 WO 2005075676A1 GB 2005000176 W GB2005000176 W GB 2005000176W WO 2005075676 A1 WO2005075676 A1 WO 2005075676A1
Authority
WO
WIPO (PCT)
Prior art keywords
pooled
dna
nucleic acid
pcr
sequence
Prior art date
Application number
PCT/GB2005/000176
Other languages
French (fr)
Inventor
Jiahui Raw Drew Zhu
Original Assignee
Selecgen Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Selecgen Limited filed Critical Selecgen Limited
Publication of WO2005075676A1 publication Critical patent/WO2005075676A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification

Definitions

  • the invention relates to a method of analysing DNA sequence variation. Particularly, it relates to methods using pooled PCR products and is particularly applicable as a sequencing procedure for screening DNA variation in specific genes or genomic regions or at controlled density in the whole genome.
  • the invention has application in the discovery of SNPs (single nucleotide polymorphisms) in organisms including humans and subsequent screening for those SNPs.
  • SNPs single nucleotide polymorphisms
  • the first step of SNP discovery there are four methods: Shotgun Sequencing (see e.g. J. Craig Venter, et al. 2001. Science 291 :1304-1315), Reduced Representation Shotgun Sequencing (RRS) (see e.g. D. Altshuler, et al. 2000. Nature 407: 513-516), Genome Partitioning (GeP) and sequencing of individual PCR (polymerase chain reaction) products.
  • the first three methods comprise high throughput sequencing approaches, and involve cloning and routine sequencing.
  • the drawback of these methods is that the chromosomal location of the search for SNPs has a random basis. The methods cannot focus on any particular group of genes or a series of particular chromosome regions.
  • the fourth method individual PCR, can focus on particular regions of the chromosome or genes based on sequences. However, it loses the high throughput feature for three reasons: (i) the sequencing of individual PCR products has to be carried out individually; (ii) the quality of sequence data obtained is generally poor; (iii) the PCR reaction has to be optimised to produce a single PCR product. This is difficult, time-consuming and failure rates are high. At the present time, however, this method of sequencing individual PCR products with a common sequencing primer addition is the only method available to find new SNPs in a specific region or within specific genes.
  • SNP validation For the second step, SNP validation, re-sequencing is the most common method employed. The procedure is similar to that described above for the individual PCR sequencing and is subject to the same constraints and disadvantages.
  • SNP genotyping or screening many methods are available in the art, including MassArray, Pyrosequencing, Techman, and Snap-Shot.
  • the invention provides a method for sequence-targeted analysis of one or more desired regions of interest within one or more nucleic acid samples comprising the steps of:
  • the amplification step, (b), is carried out using a plurality of primer pairs within the same PCR reaction.
  • the analysis step, (f) comprises the determination of the sequence of DNA, in a plurality of cells isolated from the nucleotide library, derived from the initial nucleic acid sample.
  • a common tail sequence of bases is added to the 5' end of each primer defining the region or regions of interest. In this way, the common sequence will be present on each end of the amplified fragments; this improves the cloning efficiency of the amplified sequences when creating the library.
  • the common tail sequences comprise fewer than six bases, hi this way, the possibility of introducing unintentional restriction sites is reduced.
  • the common tail sequence comprises 5' GAT, 5' GAC or 5' AGA; it has been found that these sequences are particularly effective at improving cloning efficiency, especially when using the vector sold under the Registered Trade Mark "pGem", by the Promega Corporation, USA, and other "blunt end" vector systems. Using these tail sequences the uniformity of the amplified DNA fragments was improved; the number of dominant sequences and the number of missing sequences was reduced threefold.
  • the nucleic acid samples that form the starting point for the method may be genomic DNA samples (e.g. taken from animal, bacterial or plant cells) or may be cDNA samples.
  • the new methodology can find and use lindels that other methods cannot and these lindels potentially cause more severe clinical consequences than SNPs.
  • the new methodology can target both unknown and published SNPs while other methods only use the published markers, which means that they will overlook more than 60% of gene variations: In each population more than 60% of gene variations are new.
  • the new methodology allows the targeting of a particular region suspected of disease association whereas other methods must scan the whole genome. Thus, the new methodology improves efficiency,
  • the new methodology is very flexible in project size, from ten genes to hundreds of genes, while the widely used method of genome scanning needs to scan the whole genome requiring orders of magnitude greater experimental effort.
  • Step 1 Collect DNA samples from each individual concerned, within the or each population.
  • Step 2 Combine the DNA samples within each population to obtain m pooled population DNA samples.
  • Example A For the case of example B in Step 1, the 50 samples of DNA from the first population would be pooled together to form a first pooled population DNA sample, and the 65 samples from the second population would similarly be pooled together to form a second pooled population DNA sample.
  • Step 3 For each of R regions of interest within the genome, construct a pair of PCR primers, each pair of primers defining the extremities of the regions of interest.
  • primer pairs i.e. one for each strand of the double- stranded DNA
  • Methods of constructing primer pairs for amplification of specific regions of DNA will be readily apparent to the skilled addressee.
  • a sequence of around 20 nucleotide bases will be used for each primer.
  • Step 4 Amplify the regions of interest in each of the m pooled DNA samples using the R pairs of primers, by either:
  • Example A In the embodiment to be described in more detail below (Embodiment I), a routine PCR procedure was employed, using a proprietary kit (High Fidelity Taq, Norvatis). In this case, the initial 15 PCR reactions were carried out independently.
  • Example C If all the sequences of interest require similar PCR reaction conditions, then the procedure in Step 4(c) may be used, combining all the primer pairs and running just one PCR reaction.
  • Example D If it is known that there is likely to be a high degree of similarity, or homology, between DNA sequences in different, but perhaps related species, then a single primer may be designed for use across a number of species. For example, a primer may be designed using knowledge of the rice genome, and used to probe other cereals.
  • Step 5 If more than one PCR reactions were performed on any of the pooled DNA samples, then either:
  • Example A The quantity of amplified sequence produced by the PCR reaction is often dependent on experimental conditions such as the quality and quantity of DNA polymerase used, the affinity of the primers for the target sequences, the length of the target sequence and the concentration of nucleotide bases. Thus, in order to optimise the following step of library creation, it is preferable to ensure that similar amounts of PCR-produced DNA are combined. PCR products may readily be quantified (e.g. by the intensity of bands appearing on an electrophoresis gel), and this may be used to normalise the concentrations. This is the rationale for Step 5(b).
  • Step 6 Create a library from the pooled PCR product for each of the populations.
  • Methods for creating "libraries" from mixed populations of DNA fragments are well-known to those skilled in the art of molecular biology.
  • a number of variants of library-creation are known, each of which may have particular benefits in the application of this overall methodology.
  • Example A In the embodiment to be described in more detail below (Embodiment I), the DNA fragments were cloned into a well-known vector (sold under the Registered Trade Mark "pGem” by the Promega Corporation, USA) and used to transfect a culture of E.coli.
  • the pGem vector contains a LacZ ⁇ coding region allowing selection of recombinant organisms by detection of the blue/white colour change in the presence of X-Gal (5-bromo-4-chloro-3-indolyl- ⁇ -D galactopyranoside).
  • the vector also contains T7 and SP6 promoters that serve as sequencing primer binding sites. Other vectors are available (and will doubtless be developed) for construction of DNA libraries, and alternative flanking sites may be designed to facilitate subsequent processing and analysis of the library.
  • the library created by Step 6 has a number of advantageous features:
  • Each positive integrant clone in the library contains one copy of the genetic sequence from one of the regions of interest.
  • Each of these sequences may be optionally flanked by a universal sequencing primer, introduced by suitable choice of vector for the library creation. This leads to much improved sequencing quality, when the library is subsequently used for sequencing studies.
  • the library is free from background DNA that would normally be present in the direct PCR product.
  • the library is free from excess primer pairs that would normally be present in the direct PCR product.
  • the library is free from unpolymerised nucleotide bases that would normally be present in the direct PCR product.
  • the library contains a spectrum of genetic sequences in the same ratio as that in the original population.
  • the library may be used to determine allelic frequencies within its source population.
  • comparison of allelic frequencies between libraries from different populations provides a measure of the differing allelic frequencies between the populations themselves.
  • determination of the genetic source of population-dependant phenotypic traits is facilitated.
  • Each gene sequence from the region(s) of interest within the library is present in the same genetic background, and within the same construct. As a result, optimal analysis methodologies (e.g. for sequencing) will be substantially the same for each sequence and so such a library provides an ideal starting point for automated analysis.
  • Step 7 Analyse clones taken from the library.
  • Example A In the embodiment to be described in more detail below (Embodiment I), from each of the three libraries constructed, 200 white colonies (i.e. positive transformants) were selected and sequenced, using the flanking regions as sequencing primers. Methods for sequencing DNA from such libraries are known in the art. The sequences from each library were assembled using the GAP4 program from the Staden software suite (MRC Laboratory of Molecular Biology, Cambridge, UK).
  • the method has particular advantages in revealing the association between sequence variation and phenotypic variation. This is particularly useful in assessment of treatment efficacy. Additionally, the method has application in the discovery of SNPs, their validation and association with phenotypic traits.
  • This embodiment describes the use of the general methodology for the discovery of DNA sequence variations and allele frequency in a Crohn's Disease patient population and an ulcerative colitis patient population by comparison with a control population of healthy individuals.
  • the DNA concentration in each sample was 0.75 g/litre.
  • the first pooled sample contained DNA from patients with Crohn's Disease; the second contained DNA from patients with ulcerative colitis; the third sample contained DNA from healthy volunteers.
  • Each pair of primers was designed to amplify 800 to 1200 base pair fragments.
  • a routine PCR procedure (high fidelity Taq, Norvatis) was carried out separately for each pair of primers. Accordingly, 15 PCR reaction products were obtained from each of the three pooled DNA samples.
  • PCR products from each of these reactions were examined using gel electrophoresis. Two pairs of primers prepared from the STAT6 gene failed to produce PCR products, and so were excluded from the study (i.e. R was reduced to 13). Two pairs of primers produced multiple fragments (two bands and three bands) and were treated as normal PCR products. Inspection of the density of bands on the electrophoresis gel allowed the reactions to be divided into three categories, con-esponding to the approximate concentration of the PCR products. This allowed normalisation of the quantity of DNA fragments from each reaction.
  • Each of these libraries would therefore contain at least 16 fragments (i.e. 13 reactions, 11 with one band, one with two bands and one with three bands) from its con-esponding pooled DNA sample, this sample itself containing all the DNA variation from the population used to construct it.
  • each library could contain more than 16 fragments, as additional fragments with sequence lengths similar to those expected would not have been readily apparent on the electrophoresis gel.
  • 50 pairs of primers were designed representing 50 locations to cover a region of chromosome 12, between microsatellite markers D12S368 and D12S1632. These markers are at chromosome positions 50947746 and 66834816 in chromosome 12. This region also contains the microsatellite markers D1251632, D125910 and D12583.
  • the number of amplified unique DNA fragments is subject to uncertainty, for two possible reasons: Firstly, some genes may have been present in more than one copy, and each copy may have a unique sequence, i.e. the same primer pair would result in more than one amplified fragments, h this case, the number of unique DNA fragments will depend on a number of factors, such as how conservative the primer sequence is, and how stringent the PCR reaction is. Secondly, some primers may have been purposely designed to be relatively non-specific. This may be the case when using sequence information from one species to probe the genome of a difference, but closely- related species. In this instance, therefore, the complexity of the amplified DNA pool (and therefore the DNA library itself) will be difficult to predict. Thus, in both these cases, there is a requirement to test or measure the complexity of the DNA pool from the sequencing results. The benefits of this include aiding the decision on how many colonies should be sequenced from the library to ensure good coverage of the genes to be sequenced.

Abstract

A method is provided for sequence-targeted analysis of one or more desired regions of interest within one or more nucleic acid samples comprising the steps of: combining the plurality of nucleic acid samples - if there are more than one - to form a pooled population nucleic acid sample; amplifying the desired region(s) of interest from within the ,nucleic acid sample by use of the polymerase chain reaction using one or more pairs of primers; optionally normalising the concentration of resultant PCR products; combining them to form a pooled PCR product; creating a nucleotide library from the pooled PCR product; and analysing individual clones isolated from the library.

Description

SEQUENCE-SPECIFIC DNA ANALYSIS
Field of the Invention
The invention relates to a method of analysing DNA sequence variation. Particularly, it relates to methods using pooled PCR products and is particularly applicable as a sequencing procedure for screening DNA variation in specific genes or genomic regions or at controlled density in the whole genome. The invention has application in the discovery of SNPs (single nucleotide polymorphisms) in organisms including humans and subsequent screening for those SNPs.
Background and Review of the Art known to the Applicant
DNA variation, particularly in the form of single nucleotide polymorphisms (SNPs), has been widely used as an important molecular marker in medical and agricultural applications. The current methodology for use of SNPs comprises three steps. Firstly, the discovery of SNPs within a population. Secondly, the validation of the discovered SNP candidates in relation to their genuine ability to identify phenotypic variations. Thirdly, the development of SNP assays and the use of these to test samples of interest.
For the first step of SNP discovery, there are four methods: Shotgun Sequencing (see e.g. J. Craig Venter, et al. 2001. Science 291 :1304-1315), Reduced Representation Shotgun Sequencing (RRS) (see e.g. D. Altshuler, et al. 2000. Nature 407: 513-516), Genome Partitioning (GeP) and sequencing of individual PCR (polymerase chain reaction) products. The first three methods comprise high throughput sequencing approaches, and involve cloning and routine sequencing. The drawback of these methods is that the chromosomal location of the search for SNPs has a random basis. The methods cannot focus on any particular group of genes or a series of particular chromosome regions. The fourth method, individual PCR, can focus on particular regions of the chromosome or genes based on sequences. However, it loses the high throughput feature for three reasons: (i) the sequencing of individual PCR products has to be carried out individually; (ii) the quality of sequence data obtained is generally poor; (iii) the PCR reaction has to be optimised to produce a single PCR product. This is difficult, time-consuming and failure rates are high. At the present time, however, this method of sequencing individual PCR products with a common sequencing primer addition is the only method available to find new SNPs in a specific region or within specific genes.
For the second step, SNP validation, re-sequencing is the most common method employed. The procedure is similar to that described above for the individual PCR sequencing and is subject to the same constraints and disadvantages.
For the third step, SNP genotyping or screening, many methods are available in the art, including MassArray, Pyrosequencing, Techman, and Snap-Shot.
The problem remaining is that there is currently no high throughput method for finding DNA variation in particular genes or regions. Although a few million SNPs have been found in the human genome, and a SNP haplotype map (HapMap) has been completed for the human genome, 60% of SNPs discovered in new populations are unknown. Thus, using just known SNPs, and assays developed from these, will result in failure to discover entirely novel SNPs. Thus, the process of finding new SNPs in both medical and agricultural studies continues because these new SNPs are crucial in finding the function of genes.
Furthermore, existing methods cannot find large insertions or deletions ("lindels") in genes, as well as previously unknown SNPs in particular patient groups. They can only use the published SNP markers identified earlier by studies in random healthy populations. However, such large insertion/deletions (lindels), and unknown SNP are more relevant to disease than the published SNP markers found in healthy populations. The current methods overlook these important gene variations known as lindels.
It is an object of the current invention to provide a method of discovery and analysis of DNA sequence variation that combines high throughput and gene specificity.
Summary of the Invention
In its broadest aspect, the invention provides a method for sequence-targeted analysis of one or more desired regions of interest within one or more nucleic acid samples comprising the steps of:
(a) combining the plurality of nucleic acid samples - if there are more than one - to form a pooled population nucleic acid sample;
(b) amplifying the desired region(s) of interest from within the one - or the pooled - nucleic acid sample by use of the polymerase chain reaction (PCR), said PCR reaction using one or more pairs of primers, said pair or pairs of primers defining said region or regions of interest;
(c) optionally normalising the concentration of PCR products resulting from said amplification step;
(d) combining said PCR products to form a pooled PCR product;
(e) creating a nucleotide library from said pooled PCR product; (f) analysing individual clones isolated from said nucleotide library.
Preferably, the amplification step, (b), is carried out using a plurality of primer pairs within the same PCR reaction. More preferably, and in any aspect of the invention, the analysis step, (f), comprises the determination of the sequence of DNA, in a plurality of cells isolated from the nucleotide library, derived from the initial nucleic acid sample. Preferably, in any aspect of the invention, a common tail sequence of bases is added to the 5' end of each primer defining the region or regions of interest. In this way, the common sequence will be present on each end of the amplified fragments; this improves the cloning efficiency of the amplified sequences when creating the library. More preferably, the common tail sequences comprise fewer than six bases, hi this way, the possibility of introducing unintentional restriction sites is reduced. In a particularly preferred embodiment, the common tail sequence comprises 5' GAT, 5' GAC or 5' AGA; it has been found that these sequences are particularly effective at improving cloning efficiency, especially when using the vector sold under the Registered Trade Mark "pGem", by the Promega Corporation, USA, and other "blunt end" vector systems. Using these tail sequences the uniformity of the amplified DNA fragments was improved; the number of dominant sequences and the number of missing sequences was reduced threefold.
The nucleic acid samples that form the starting point for the method may be genomic DNA samples (e.g. taken from animal, bacterial or plant cells) or may be cDNA samples.
Whilst each individual step of the current invention is known in itself, the unique combination of steps disclosed herein provides unexpected benefits for DNA analysis, which are discussed in more detail below. In summary, however:
(i) The new methodology can find and use lindels that other methods cannot and these lindels potentially cause more severe clinical consequences than SNPs. (ii) The new methodology can target both unknown and published SNPs while other methods only use the published markers, which means that they will overlook more than 60% of gene variations: In each population more than 60% of gene variations are new. (iii) The new methodology allows the targeting of a particular region suspected of disease association whereas other methods must scan the whole genome. Thus, the new methodology improves efficiency, (iv) The new methodology is very flexible in project size, from ten genes to hundreds of genes, while the widely used method of genome scanning needs to scan the whole genome requiring orders of magnitude greater experimental effort. General scheme for sequence-targeted analysis
The following scheme outlines the generalised methodology comprising the current invention. The scheme relates to a general situation where it is required to perform a genetic analysis of DNA comprising particular regions of interest in a number, m, (where nτ=l, 2, 3 etc.) of populations of individuals, each population containing Nj individuals (i = l, 2, ..., m).
Key features and benefits of the invention are identified at each step, together with examples to illustrate the scope of application of the invention.
Step 1: Collect DNA samples from each individual concerned, within the or each population. Example A: It may be desired to determine the frequency of occurrence of a particular allele within a population of plant species and varieties within a particular genus. In this case, there would be one population (i.e. m=T) and the "individuals" would comprise each of the distinct species/varieties available. Example B: It may be desired to investigate the genetic basis of a particular medical condition of humans. In this case, there may be two populations (i.e. m=2) comprising, say 50 people with the condition, and 65 people without the condition (i.e. Nι=50 and N2=65). DNA samples would be obtained from each of these 115 individuals by any number of means that will be apparent to the skilled addressee, and purified to the extent required for selective amplification (e.g. by PCR).
Step 2: Combine the DNA samples within each population to obtain m pooled population DNA samples. Example A: For the case of example B in Step 1, the 50 samples of DNA from the first population would be pooled together to form a first pooled population DNA sample, and the 65 samples from the second population would similarly be pooled together to form a second pooled population DNA sample. Step 3: For each of R regions of interest within the genome, construct a pair of PCR primers, each pair of primers defining the extremities of the regions of interest. Example A: hi the embodiment to be described more fully below (Embodiment I), 15 regions of interest were identified within the genome (i.e. R=-15), comprising three regions in each of 5 genes. Thus, 15 pairs of primers were constructed. In this example, routine quality control of the following PCR reactions (see Step 4) revealed that two of the primer pairs failed to give appropriate PCR products, and so those PCR products were discarded. For the ensuing analysis, therefore, the number of primer pairs was reduced accordingly (i.e. R=13). Small departures from the overall protocol, such as this, will be routinely expected, and easily overcome by the skilled but uninventive addressee.
Methods of constructing primer pairs (i.e. one for each strand of the double- stranded DNA) for amplification of specific regions of DNA will be readily apparent to the skilled addressee. Typically, a sequence of around 20 nucleotide bases will be used for each primer.
Step 4: Amplify the regions of interest in each of the m pooled DNA samples using the R pairs of primers, by either:
(a) performing R PCR reactions on the or each pooled DNA samples, each PCR reaction using one of the R primer pairs; or (b) combining some of the primer pairs to form primer pair groups and perform a number (<R) of PCR reactions on the or each pooled DNA samples, such that each primer pair is used at least once; or (c) combining all the primer pairs, and perform one PCR reaction on the or each pooled DNA samples. Example A: In the embodiment to be described in more detail below (Embodiment I), a routine PCR procedure was employed, using a proprietary kit (High Fidelity Taq, Norvatis). In this case, the initial 15 PCR reactions were carried out independently. Example B: Situations may arise when, say, three primer pairs are to be used (i.e. R=3). Two of these may define particularly long DNA sequences, and the third on may define a short sequence. In instances such as this, it may be more convenient (especially if many populations are to be studied) to combine the two primer pairs defining the long sequence and, following the procedure in Step 4(b), perform one PCR reaction using the combined primer pairs, and one PCR reaction using the remaining primer pair. In this way, the PCR reaction conditions (such as melting temperature and annealing time) may be optimised in the first instance for the long sequences and in the second instance for the short sequence. Example C: If all the sequences of interest require similar PCR reaction conditions, then the procedure in Step 4(c) may be used, combining all the primer pairs and running just one PCR reaction. Example D: If it is known that there is likely to be a high degree of similarity, or homology, between DNA sequences in different, but perhaps related species, then a single primer may be designed for use across a number of species. For example, a primer may be designed using knowledge of the rice genome, and used to probe other cereals.
Step 5: If more than one PCR reactions were performed on any of the pooled DNA samples, then either:
(a) combine the PCR products resulting from each pooled DNA sample; or (b) assess the relative concentration of PCR product resulting from each pooled DNA sample, and combine the PCR products in appropriate ratios so as to produce an approximately equal concentrations of PCR products from each primer pair in each pooled DNA sample.
Example A: The quantity of amplified sequence produced by the PCR reaction is often dependent on experimental conditions such as the quality and quantity of DNA polymerase used, the affinity of the primers for the target sequences, the length of the target sequence and the concentration of nucleotide bases. Thus, in order to optimise the following step of library creation, it is preferable to ensure that similar amounts of PCR-produced DNA are combined. PCR products may readily be quantified (e.g. by the intensity of bands appearing on an electrophoresis gel), and this may be used to normalise the concentrations. This is the rationale for Step 5(b).
Step 6: Create a library from the pooled PCR product for each of the populations. Methods for creating "libraries" from mixed populations of DNA fragments are well-known to those skilled in the art of molecular biology. A number of variants of library-creation are known, each of which may have particular benefits in the application of this overall methodology.
Example A: In the embodiment to be described in more detail below (Embodiment I), the DNA fragments were cloned into a well-known vector (sold under the Registered Trade Mark "pGem" by the Promega Corporation, USA) and used to transfect a culture of E.coli. The pGem vector contains a LacZα coding region allowing selection of recombinant organisms by detection of the blue/white colour change in the presence of X-Gal (5-bromo-4-chloro-3-indolyl- β-D galactopyranoside). The vector also contains T7 and SP6 promoters that serve as sequencing primer binding sites. Other vectors are available (and will doubtless be developed) for construction of DNA libraries, and alternative flanking sites may be designed to facilitate subsequent processing and analysis of the library.
The library created by Step 6 has a number of advantageous features:
(i) Each positive integrant clone in the library contains one copy of the genetic sequence from one of the regions of interest. (ii) Each of these sequences may be optionally flanked by a universal sequencing primer, introduced by suitable choice of vector for the library creation. This leads to much improved sequencing quality, when the library is subsequently used for sequencing studies. (iii) The library is free from background DNA that would normally be present in the direct PCR product. (iv) The library is free from excess primer pairs that would normally be present in the direct PCR product. (v) The library is free from unpolymerised nucleotide bases that would normally be present in the direct PCR product. (vi) For a given region of interest, the library contains a spectrum of genetic sequences in the same ratio as that in the original population. As a result, the library may be used to determine allelic frequencies within its source population. Also, comparison of allelic frequencies between libraries from different populations provides a measure of the differing allelic frequencies between the populations themselves. Thus, determination of the genetic source of population-dependant phenotypic traits is facilitated. (vii) Each gene sequence from the region(s) of interest within the library is present in the same genetic background, and within the same construct. As a result, optimal analysis methodologies (e.g. for sequencing) will be substantially the same for each sequence and so such a library provides an ideal starting point for automated analysis.
Step 7: Analyse clones taken from the library. Example A: In the embodiment to be described in more detail below (Embodiment I), from each of the three libraries constructed, 200 white colonies (i.e. positive transformants) were selected and sequenced, using the flanking regions as sequencing primers. Methods for sequencing DNA from such libraries are known in the art. The sequences from each library were assembled using the GAP4 program from the Staden software suite (MRC Laboratory of Molecular Biology, Cambridge, UK).
The method has particular advantages in revealing the association between sequence variation and phenotypic variation. This is particularly useful in assessment of treatment efficacy. Additionally, the method has application in the discovery of SNPs, their validation and association with phenotypic traits. Preferred Embodiments of the Invention-Embodiment I
This embodiment describes the use of the general methodology for the discovery of DNA sequence variations and allele frequency in a Crohn's Disease patient population and an ulcerative colitis patient population by comparison with a control population of healthy individuals.
DNA samples were collected, with consent, from individuals belonging to each of the three populations (i.e. m=3). Each of the three populations contained DNA from 20 individuals (i.e. N,=20, for i=l,2,3). The DNA concentration in each sample was 0.75 g/litre. The first pooled sample contained DNA from patients with Crohn's Disease; the second contained DNA from patients with ulcerative colitis; the third sample contained DNA from healthy volunteers.
Three pairs of primers were designed from the genomic sequence of each of five genes: HEM1, ERBB3, Interleukin 23A, STAT6 and Interferon Gamma. Thus, the methodology used a total of 15 primer pairs (i.e. R=15).
Each pair of primers was designed to amplify 800 to 1200 base pair fragments. A routine PCR procedure (high fidelity Taq, Norvatis) was carried out separately for each pair of primers. Accordingly, 15 PCR reaction products were obtained from each of the three pooled DNA samples.
The PCR products from each of these reactions were examined using gel electrophoresis. Two pairs of primers prepared from the STAT6 gene failed to produce PCR products, and so were excluded from the study (i.e. R was reduced to 13). Two pairs of primers produced multiple fragments (two bands and three bands) and were treated as normal PCR products. Inspection of the density of bands on the electrophoresis gel allowed the reactions to be divided into three categories, con-esponding to the approximate concentration of the PCR products. This allowed normalisation of the quantity of DNA fragments from each reaction. Following normalisation, the 13 reactions from each pool of DNA sample were pooled into one tube (the 'pooled PCR product') and cloned into a pGem (Promega Corporation, USA) vector using routine procedures, and transfected into E.coli. This resulted in the construction of three DNA libraries.
Each of these libraries would therefore contain at least 16 fragments (i.e. 13 reactions, 11 with one band, one with two bands and one with three bands) from its con-esponding pooled DNA sample, this sample itself containing all the DNA variation from the population used to construct it.
It will be evident that each library could contain more than 16 fragments, as additional fragments with sequence lengths similar to those expected would not have been readily apparent on the electrophoresis gel.
From each of the three DNA libraries, 200 transformed colonies were selected (using the white/blue reaction supplied by the pGem vector) and sequenced by routine methodology.
The sequences from each library were assembled separately using the GAP4 programme in the Staden software suite (Medical Research Council Laboratory of Molecular Biology, Cambridge, UK).
The DNA sequence variations were revealed from this GAP4 assembling. This analysis revealed 22 variation sites, of which 20 variations were within groups. 45 haplotypes were noted. Six of these haplotypes showed a skewed frequency distribution between the three populations. These are thus ideal candidates for further genotyping studies to identify the genetic association to the two diseases.
Preferred Embodiments-Embodiment II
In this example, the same DNA samples as described in Embodiment I were used, i.e. three pool DNA samples representing individuals with Crohn's Disease, ulcerative colitis, and individuals from a control population.
In this embodiment, 50 pairs of primers were designed representing 50 locations to cover a region of chromosome 12, between microsatellite markers D12S368 and D12S1632. These markers are at chromosome positions 50947746 and 66834816 in chromosome 12. This region also contains the microsatellite markers D1251632, D125910 and D12583.
Within these 50 locations or loci, 11 were from the experiment of Embodiment I (the ones giving the single band gel pattern), 26 were from known genes in addition to those 5 genes used in Embodiment I, six were from around the microsatellite markers D12S83, D12S85 and D12S90. A further eight were randomly chosen to fill the gaps. On average, the gap among the 50 selected sequences was 255kb. The largest gap was around 88kb and the smallest gap about 50kb.
Using a similar methodology as that described in Embodiment I, three libraries were constructed after the PCR products were checked and selected. From the PCR products, 48 PCR reactions were selected for use in library creation, and 57 DNA fragments were expected to be represented.
From each of these three libraries, 600 transfonnant colonies were selected and sequenced. The sequences were analysed in the same way as in Embodiment I. In total, 231 haplotypes were noted, 20 of these showing a skewed frequency distribution between the three population groups, including the 6 haplotypes identified in Embodiment I.
Evaluation of the Complexity of the Library
For a given set of primers, the number of amplified unique DNA fragments is subject to uncertainty, for two possible reasons: Firstly, some genes may have been present in more than one copy, and each copy may have a unique sequence, i.e. the same primer pair would result in more than one amplified fragments, h this case, the number of unique DNA fragments will depend on a number of factors, such as how conservative the primer sequence is, and how stringent the PCR reaction is. Secondly, some primers may have been purposely designed to be relatively non-specific. This may be the case when using sequence information from one species to probe the genome of a difference, but closely- related species. In this instance, therefore, the complexity of the amplified DNA pool (and therefore the DNA library itself) will be difficult to predict. Thus, in both these cases, there is a requirement to test or measure the complexity of the DNA pool from the sequencing results. The benefits of this include aiding the decision on how many colonies should be sequenced from the library to ensure good coverage of the genes to be sequenced.
A statistical analysis shows that the complexity of the pool can be estimated from the fonnula:
Figure imgf000014_0001
where:
E is the size, or complexity, of the pool (i.e. the number of unique sequences it contains) Nis the total number of sequences that have been determined, there being k unique such sequences /.,• is the number of occurrences of the it unique sequence (i = 1, 2, ... k)

Claims

1. A method for sequence-targeted analysis of one or more desired regions of interest within one or more nucleic acid samples comprising the steps of:
(a) combining the plurality of nucleic acid samples - if there are more than one - to fonn a pooled population nucleic acid sample;
(b) amplifying the desired region(s) of interest from within the one - or the pooled - nucleic acid sample by use of the polymerase chain reaction (PCR), said PCR reaction using one or more pairs of primers, said pair or pairs of primers defining said region or regions of interest;
(c) optionally normalising the concentration of PCR products resulting from said amplification step;
(d) combining said PCR products to form a pooled PCR product; (e) creating a nucleotide library from said pooled PCR product;
(f) analysing individual clones isolated from said nucleotide library.
2. The method of Claim 1 wherein the amplification step, (b), is carried out using a plurality of primer pairs within the same PCR reaction.
3. The method of any of any preceding claim wherein the analysis step, (f), comprises the deteπnination of the sequence of DNA, in a plurality of cells isolated from the nucleotide library, derived from the initial nucleic acid sample.
4. The method of any preceding claim wherein a common tail sequence of bases is added to the 5' end of each primer defining the region or regions of interest.
5. The method of claim 4 wherein the common tail sequence comprises fewer than six bases.
6. The method of claim 5 wherein the common tail sequence comprises 5' GAT, 5' GAC or 5' AGA.
PCT/GB2005/000176 2004-02-05 2005-01-17 Sequence-specific dna analysis WO2005075676A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0402530.0 2004-02-05
GB0402530A GB2410796A (en) 2004-02-05 2004-02-05 Sequence specific DNA analysis

Publications (1)

Publication Number Publication Date
WO2005075676A1 true WO2005075676A1 (en) 2005-08-18

Family

ID=31985697

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2005/000176 WO2005075676A1 (en) 2004-02-05 2005-01-17 Sequence-specific dna analysis

Country Status (2)

Country Link
GB (1) GB2410796A (en)
WO (1) WO2005075676A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5849492A (en) * 1994-02-28 1998-12-15 Phylogenetix Laboratories, Inc. Method for rapid identification of prokaryotic and eukaryotic organisms
WO1999051774A2 (en) * 1998-04-02 1999-10-14 Tellus Genetic Resources, Inc. A method for obtaining a plant with a genetic lesion in a gene sequence
WO2002101090A2 (en) * 2001-06-13 2002-12-19 Centre National De La Recherche Scientifique Method for determining the existence of animal or vegetable mixtures in organic substrates

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5795722A (en) * 1997-03-18 1998-08-18 Visible Genetics Inc. Method and kit for quantitation and nucleic acid sequencing of nucleic acid analytes in a sample
US6830887B2 (en) * 1997-03-18 2004-12-14 Bayer Healthcare Llc Method and kit for quantitation and nucleic acid sequencing of nucleic acid analytes in a sample
WO2001055454A1 (en) * 2000-01-28 2001-08-02 Althea Technologies, Inc. Methods for analysis of gene expression

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5849492A (en) * 1994-02-28 1998-12-15 Phylogenetix Laboratories, Inc. Method for rapid identification of prokaryotic and eukaryotic organisms
WO1999051774A2 (en) * 1998-04-02 1999-10-14 Tellus Genetic Resources, Inc. A method for obtaining a plant with a genetic lesion in a gene sequence
WO2002101090A2 (en) * 2001-06-13 2002-12-19 Centre National De La Recherche Scientifique Method for determining the existence of animal or vegetable mixtures in organic substrates

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHAPMAN DEMIAN D ET AL: "A streamlined, bi-organelle, multiplex PCR approach to species identification: Application to global conservation and trade monitoring of the great white shark, Carcharodon carcharias.", CONSERVATION GENETICS, vol. 4, no. 4, 2003, pages 415 - 425, XP002327285, ISSN: 1566-0621 *
MALEHORN DAVID E ET AL: "Detection of cystic fibrosis mutations by peptide mass signature genotyping.", CLINICAL CHEMISTRY. AUG 2003, vol. 49, no. 8, August 2003 (2003-08-01), pages 1318 - 1330, XP002327287, ISSN: 0009-9147 *
SHAW S H ET AL: "ALLELE FREQUENCY DISTRIBUTIONS IN POOLED DNA SAMPLES: APPLICATIONS TO MAPPING COMPLEX DISEASE GENES", GENOME RESEARCH, COLD SPRING HARBOR LABORATORY PRESS, US, vol. 8, no. 2, February 1998 (1998-02-01), pages 111 - 123, XP001154417, ISSN: 1088-9051 *
ZIMMERMANN K ET AL: "Digestion of terminal restriction endonuclease recognition sites on PCR products.", BIOTECHNIQUES. APR 1998, vol. 24, no. 4, April 1998 (1998-04-01), pages 582 - 584, XP002327286, ISSN: 0736-6205 *

Also Published As

Publication number Publication date
GB2410796A (en) 2005-08-10
GB0402530D0 (en) 2004-03-10

Similar Documents

Publication Publication Date Title
AU2018236781B2 (en) Method for accurate sequencing of dna
EP2660331B1 (en) Method for single cell genome analysis and kit therefor
Tollenaere et al. SNP design from 454 sequencing of Podosphaera plantaginis transcriptome reveals a genetically diverse pathogen metapopulation with high levels of mixed-genotype infection
US11795451B2 (en) Primer for next generation sequencer and a method for producing the same, a DNA library obtained through the use of a primer for next generation sequencer and a method for producing the same, and a DNA analyzing method using a DNA library
Marmiroli et al. Advanced PCR techniques in identifying food components
WO2014143616A1 (en) Assessing dna quality using real-time pcr and ct values
AU647806B2 (en) Genomic mapping method by direct haplotyping using intron sequence analysis
CN114457171B (en) Haplotype molecular marker related to reproductive performance of laying ducks and application thereof
CN110894531A (en) STR locus set for pig and application
RU2740798C1 (en) Markers for marker selection of soya according to utility signs
WO2005075676A1 (en) Sequence-specific dna analysis
WO1997031011A1 (en) Microsatellite markers for identifying canine genetic diseases or traits
KR101997134B1 (en) PCR Premixes composition containing STR markers and a genetic analysis kit for scientific investigation education using the composition
Nehlsen Unveiling the murine t-haplotype’s extent and emergence of diversity in MHC class II genes
Korchagin et al. Molecular Cloning and Characteristics of Allele Variants (GATA) n, the Microsatellite Locus Du281 of Parthenogenetic Caucasian Rock Lizard (Darevskia unisexualis) Genome
CN116970707A (en) Composite amplification kit for detecting human Y chromosome locus based on NGS technology
Soldati Characterisation of apparent mismatches detected during routine short tandem repeat analysis in parentage investigations
Saminadin-Peter Evolution of gene expression and gene-regulatory sequences in Drosophila melanogaster
EP1581655A2 (en) Haplotype partitioning
CN117385010A (en) qPCR method for detecting gene polymorphism
CN116377084A (en) High-efficiency autosomal micro-haplotype genetic marker system, and detection primer and kit thereof
Mohamed DNA markers and their application in animal genetics: an overview
Limanskaya et al. Identification of wild-type Mycobacterium tuberculosis isolates and point mutations associated with isoniazid resistance
Witherspoon et al. Mobile element scanning (ME-Scan) identifies thousands of novel
Tollenaere et al. SNP Design from 454 Sequencing of Podosphaera plantaginis Transcriptome

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase