WO2000065088A2 - Primers for identifying typing or classifying nucleic acids - Google Patents

Primers for identifying typing or classifying nucleic acids Download PDF

Info

Publication number
WO2000065088A2
WO2000065088A2 PCT/EP2000/003636 EP0003636W WO0065088A2 WO 2000065088 A2 WO2000065088 A2 WO 2000065088A2 EP 0003636 W EP0003636 W EP 0003636W WO 0065088 A2 WO0065088 A2 WO 0065088A2
Authority
WO
WIPO (PCT)
Prior art keywords
primers
primer
extendible
hla
nucleic acid
Prior art date
Application number
PCT/EP2000/003636
Other languages
French (fr)
Other versions
WO2000065088A3 (en
Inventor
Per-Johan Ulfendahl
Kin-Chun Wong
Original Assignee
Amersham Pharmacia Biotech Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Amersham Pharmacia Biotech Ab filed Critical Amersham Pharmacia Biotech Ab
Priority to AU50625/00A priority Critical patent/AU5062500A/en
Publication of WO2000065088A2 publication Critical patent/WO2000065088A2/en
Publication of WO2000065088A3 publication Critical patent/WO2000065088A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6834Enzymatic or biochemical coupling of nucleic acids to a solid phase
    • C12Q1/6837Enzymatic or biochemical coupling of nucleic acids to a solid phase using probe arrays or probe chips
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6881Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for tissue or cell typing, e.g. human leukocyte antigen [HLA] probes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/172Haplotypes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • DNA-sequence analysis is rapidly becoming a standard tool in modern, molecular biology research. Examples of applications include: Sequencing of unknown DNA-sequences, Identifying novel genes in stretches of sequenced DNA, Predicting protein-sequence and -structure from DNA-sequence alone and Identification of known gene-variations (sometimes called "typing a gene").
  • HLA Human Leucocyte Antigen
  • MHC Major Histocompatibility Complex
  • Another application where a rapid and accurate identification of a gene is desired is when trying to identify unknown bacteria.
  • a rapid identification of the bacteria causing the illness of a patient makes it possible to administer the correct medication early in the treatment of the disease, thus reducing the discomfort for the patient.
  • every self- replicating organism so far studied use ribosomes when translating mRNA to proteins, analysis of one of the genes coding for the ribosome, for instance the 16S rRNA in the case of prokaryotes, could be used to identify the organism in question.
  • APEX Arrayed Primer Extension
  • the array primer extension method APEX for resequencing would need more than 16,000 primers if all DQB alleles would be sequenced from a 500 bp long PCR fragment. If all DQB alleles in pairs should be combined the number of primers might be even higher which would be the situation for a heterozygote found in most individuals. But this might not be necessary, if some variations always or never occur together. This needs to be studied though, and a way found to determine the least number of primers (and what their sequences are) required for unambiguously identifying those genes.
  • An object of this invention is to find and implement an efficient algorithm capable of doing just that.
  • the algorithm should preferably also take into account the melting points of the primers, so that the extension reaction can take place under optimal conditions for all of the primers on the chip. It should also minimise the number of "self-extended” primers, i.e. primers that can extend themselves without any sample DNA.
  • This algorithm is then to be tested and evaluated on the HLA and 16S rRNA- genes.
  • HLA is chosen partly because of the importance of rapid typing of these genes, leading to the fact that there are many other methods to which APEX can be compared. It is also because the HLA-genes are "easy” to work with, since they rarely contain any insertions or deletions. These kinds of variations in the gene could potentially create problems when designing primers for APEX.
  • the 16S rRNA contains insertions and deletions and can thus be used to see if the algorithm can handle such variations.
  • the invention provides a method of identifying a set of extendible primers for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms wherein: i) all possible nucleotide sequences of a chosen length of the nucleic acid are identified and their corresponding extendible primers, ii) at least one extendible primer is removed from the set wherein the at least one primer removed identifies a segment of the nucleic acid identified by at least one other primer.
  • the method includes between step i) and ii): ia) potential extensions for each primer are identified with respect to each nucleotide sequence, ib) for each extendible primer the identified potential extensions are compared to determine which pairs of sequences can be discriminated by the primer.
  • a matrix of primers and pairs of primer extensions is prepared in binary form and is subjected to analysis by a set covering problem (SCP) algorithm as described in more detail below.
  • SCP set covering problem
  • the invention also includes a set of extendible primers, for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms, identified by the method as defined.
  • the primers are attached by 5'-ends to a surface of a support on which they are presented in the form of an array.
  • the invention provides a set of extendible primers, for use in the identification, typing or classification of a human leucocyte antigen (HLA) gene as indicated, the set comprising about the number of primers indicated and being capable of distinguishing about the number of alleles indicated:
  • HLA human leucocyte antigen
  • the invention provides a set of extendible primers, for use in the identification, typing or classification of 16S rRNA, wherein the set comprises about 210 primers and is capable of distinguishing at least about 1207 different sequences.
  • the approximate number of primers is indicated. As indicated below, it may be possible by the use of the algorithms exemplified or other algorithms to generate slightly smaller sets of primers capable of distinguishing the number of alleles or sequences indicated, and these sets are envisaged according to the invention. Of course, other primers may be present in addition to those indicated as essential, and may be useful for checking purposes.
  • the number of alleles or sequences indicated represents the approximate known number of polymorphisms or different sequences, and these will surely increase with time.
  • the invention provides a method of identification, typing or classification of a nucleic acid of known sequence having known polymorphisms, by the use of the set of extendible primers as defined, which method comprises applying the nucleic acid or fragments thereof to the set of extendible primers under hybridisation conditions and effecting template-directed chain extension of extendible primers that have formed hybrids.
  • template-directed chain extension is effected using four different fluorescently labelled chain-terminating nucleotide analogues, and results are analysed by an imaging system such as total internal reflection fluorescence (TIRF) or scanning confocal microscopy.
  • TIRF total internal reflection fluorescence
  • the various steps of the method may be performed as described in the literature for the known APEX technique.
  • the invention provides a kit for use in the identification, typing or characterisation of a nucleic acid of known sequence having known polymorphisms, comprising the set of extendible primers as defined.
  • the invention provides an array of sets of extendible primers as defined, for the simultaneous identification, typing or classification of two or more different HLA genes.
  • the present invention it has been realised that where a number of different alleles are to be identified, the total number of primers required to distinguish each of the alleles could be reduced as some primers would be common to all of the alleles, for example.
  • complete sets of primers for identification of each allele are identified and then the total number of primers in the combined sets is reduced using predetermined rules.
  • the present invention is based on the premise that as the primers are used to identify the presence or absence of a particular nucleotide sequence in any allele, the specific nucleotide that extends any particular primer is of less relevance than simply whether the primer has been extended.
  • SCP Set Covering Problem
  • Figure 1 is a diagram of a signal matrix in accordance with the present invention
  • Figure 2 is a diagram of the corresponding binary matrix for the signal matrix of Figure 1 ;
  • Figure 3 is a flow diagram of the steps for reducing the primer set in accordance with the present invention. The following is an explanation to assist in an understanding of the principles underlying the manner in which the number of primers used in the identification of a plurality of sequences may be reduced.
  • the number of primers required to identify k sequences grows as 0(k»l), where / is the length of the sequences as each sequence requires / primers.
  • / is the length of the sequences as each sequence requires / primers.
  • the less the sequences differ from one another the fewer primers are required as many of the primers required for identification of a first sequence may also be of use in identification of another sequence. This effect becomes more pronounced the greater the number of sequences to be identified and the greater the similarities.
  • a signal matrix of k x n can be constructed. Each element in the matrix represents the signal, if any, that is generated by a particular primer with respect to a particular sequence.
  • the signal will either be one of the four nucleotides 'A', 'C ⁇ 'G', or T or no signal '-'.
  • Figure 1 is an example of such a signal matrix where, for example, the signal generated by primer 2 with respect to sequence 3 is T.
  • the signal matrix is then converted into a binary matrix that represents whether the signals for any particular primer differ with respect to different sequences.
  • the same signal 'G' is generated for both sequences 1 and 2 but a different signal T is generated with respect to sequence 3.
  • the binary matrix is constructed by considering each column (each primer) of the signal matrix and comparing each signal in that column in turn.
  • the first row of the matrix represents a comparison of the signals for the first and second sequences
  • the second row represents a comparison of the signals for the first and third sequences
  • the third row represents a comparison of the signals for the second and third sequences.
  • Binary '0' represents the comparison revealing the same signal
  • binary '1 ' represents the comparison reveals different signals.
  • the binary matrix renders the data contained within that matrix suitable for mathematical analysis.
  • SCP Set Covering Problem
  • the most simple heuristic is the greedy algorithm, where columns are added one at a time.
  • the column to be added in each step is chosen so as to cover as many uncovered rows as possible (a row is covered if it has at least one non-zero element).
  • S r is the set of columns already included in the solution at iteration r
  • R r is the set of rows with no non-zero elements at iteration r
  • column; ' / is selected according to:
  • C j / P other terms can be used.
  • Example terms are c,, C j / log 2 Pj or c, / (P j )2.
  • Greedy algorithms of this type are described in "An Efficient Heuristic for Large Set Covering Problems", Vasko, Wilson, Naval Research Logistics Quarterly 1984, 31 :163-171 the contents of which is incorporated herein by reference. The difference is in how much emphasis to place on the cost of the column versus how many rows the column covers. It is shown, however, that this entire class of heuristics share the same worst case behaviour. If we denote the set of columns in the solution as S and the solution value as Z, then the worst case behaviour can be described as:
  • Lagrangian relaxation heuristic is believed to be some kind of Lagrangian relaxation heuristic, where in each iteration the Lagrange multipliers for each column are used to calculate the Lagrangian cost for the columns.
  • Lagrangian relaxation heuristic is described in "A Heuristic Method for the Set Covering Problem", Capara et al Technical Report OR-95-8, Operations Research Group, University of Bologna 1995 the content of which is incorporated herein by reference.
  • a near optimal vector of these costs is then calculated by a subgradient algorithm, before being used as input to a greedy algorithm. This is repeated until no improvements in the solution can be made.
  • Lagrangian subgradient methods the Lagrangian of the original problem is considered instead of the original problem. In this case, the Lagrangian will be
  • u is the Lagrangian multiplier for row / ' .
  • q(u) is the Lagrangian cost associated with column j, and is defined by
  • Equation 5 An optimal solution to Equation 4 is given by
  • L(u) can also be seen as an estimate of the lower bound for the solution, i.e. the sum of the costs for the columns in the optimal solution to the SCP will be > L(u).
  • the solution to the SCP can be found by finding an optimal multiplier vector u instead, but this will require much computation especially for a large SCP. But near-optimal multiplier vectors can be found within short time by using the subgradient vector s(u), defined by
  • u can be refined iteratively by using for example
  • Equation 8 where ⁇ > 0 is a step-size parameter and UB is an upper bound on the value of the solution.
  • the initial u° can be defined arbitrarily.
  • To solve the SCP first a near-optimal multiplier vector u is found. This and Equation 6 is then used as a basis to form a feasible solution. The upper bound UB can then be updated to the value of this feasible solution (if it is better than the previous best solution), and a new near-optimal multiplier vector found and so on until convergence is reached.
  • Another alternative computational method that may be employed to solve such a SCP is 'surrogate relaxation' in which in each iteration a corresponding continuous problem is solved and made feasible before a sub-gradient algorithm is applied.
  • genetic algorithms may be employed in which the 'genome' consists of n bits, one bit for each of the columns.
  • a primer in the selected reduced set may generate a negative, '-', signal rather than a positive signal, A, C, G, T.
  • A, C, G, T a positive signal
  • the least number of positive signals as well as the least number of differences in the signal pattern is preferably larger than one.
  • all possible primers are selected (10) using the standard APEX procedure to produce a first set of primers.
  • a substring of the sequence to be analysed is used to construct one primer, then the substring is displaced by one base and another primer is constructed. This process is carried out from the start of the sequence until the entire sequence has been covered. Both strands of DNA are used and this is repeated for all sequences.
  • the primers should be long enough to be capable of discriminating between exact matches and mismatches involving one or two nucleotide pairs. Conveniently, the primers are 13bp long as this has been found to be sufficient to ensure the reaction, or longer to increase hybrid stability. However, to avoid steric hindrance on the chip each primer may be 5'-tailed. In this example, twelve T's are added to the 5'-end of the primer so that the final length of the primers is 25bp.
  • primers that are not suitable as primers are rejected (12) and the rest is included in a primary primer set.
  • Unsuitable primers are those where the three bases at the 3'-end are complementary to any substring of the primer. In some instances this can result in the primer being extended by a neighbouring primer and not the sample DNA as a template and for that reason such primers are considered unsuitable.
  • any primers that would produce ambiguous signals are identified and rejected (14).
  • a primer produces an ambiguous signal where it is not known which of the four bases is in the relevant position.
  • Each of the remaining primers in the primary set primer is then compared to each sequence in turn to determine whether the primer is extendible by each sequence and if the primer is extendible the base with which it would be extended is determined.
  • a signal matrix of the primers with respect to each of the sequences is thus generated (16).
  • the three bases in the 3'-end of the primer must hybridise to the DNA. Otherwise the enzyme responsible for the extension will not be able to add a nucleotide to the primer. Of the rest of the primer (the poly-T tail excluded), at most two mismatches are allowed, otherwise the primer-DNA duplex is considered to be too unstable to be extended. In ordinary PCR, all the bases must match in order for the primer to be extended. But then the temperature is raised to the melting point, T m , of the primer in the extension step. In APEX, this reaction is carried out at 45°C, which is around 10°-20° below T m of most primers. This means that the primers will hybridise to the DNA despite a few mismatches, which is why two mismatches are allowed here.
  • a primer could hybridise to a sequence in more than one position, and sometimes a primer could hybridise to both strands of one allele and give different signals. In those cases all the different signals are combined to form one resulting signal (e.g. 'A' and 'C together forms 'M', which is the NC-IUB (NC-IUB, 1985) code for this combination).
  • the entries for each row are compared against one another, in other words for each primer the signals produced by the primer for each sequence are compared against each other.
  • a binary matrix is thus generated (18) of the primers with respect to the identity or difference of signals for pairs of sequences.
  • the binary matrix contains non-zero entries where the primer is able to distinguish between a pair of sequences.
  • the number of pairs of sequences that each primer can distinguish between are counted and a score is allocated to each primer (20) in dependence on the total number of pairs of sequences counted. Thus, the number of non-zero elements for each primer are counted.
  • Primers that are unable to distinguish between any pairs of sequences are rejected (22) and the remaining primers are sorted (24) in order of their score with the primers with the higher scores at the beginning.
  • a core of primers is created next (26). The primer with the highest score is selected. Where two primers with equal scores exist, the number of positive signals is determined for each and the primer with the greater number of positive signals is chosen. If both primers remain equal, one is then selected arbitrarily over the other. After the main primer has been selected, the first twenty (five times the desired redundancy which is four here) primers giving positive signals for each sequence in turn are selected for the core. All remaining primers are rejected.
  • a greedy algorithm is then run (28) using the core set of primers to identify the minimum number of primers necessary to distinguish each sequence.
  • primers are added one at a time with each primer being selected in turn in relation to the number of uncovered rows it is capable of covering.
  • the reduced set of primers is checked for any sequences that has fewer than four positive signals and extra primers are added as necessary to meet this minimum requirement.
  • a redundancy check is then performed (30) to identify whether any more primers can be removed. During the redundancy check each primer is "tentatively" removed in turn to see whether the remaining primers meet the minimum requirements.
  • next primer is tried. Otherwise the primer is temporarily removed from the set, and the process continues with the next primer in line. This process continues until no more primers can be removed, in which case the last primer to be removed is added back to the set, and the next primer in line tentatively removed and so on.
  • This can be viewed as a depth-first search of a tree where the nodes are combinations of primers, and the number of primers in each node is one less than in a node one level above. The root node thus contains all primers from the greedy algorithm. It has p (the number of primers after the greedy algorithm) primers in it.
  • CFT a modified algorithm
  • This algorithm consists of three main phases: A subgradient phase where a near-optimal multiplier vector is found, a heuristic phase where a solution to the SCP is found and column-fixing, designed to improve the results of the heuristic phase.
  • a near-optimal multiplier vector u is found using Equation 8. At the beginning, the starting vector u° used is defined as
  • Equation 9 Later calls use the last vector u before column fixing, and apply a small perturbation before using it as the starting vector.
  • the perturbation is randomly (and uniformly) distributed in the range ⁇ 10% for each element.
  • the sequence of multiplier vectors is considered to have converged when the improvement in L(u) in the last 50 iterations is smaller than 0.1 %, or when the number of iterations reached 10 x m.
  • the factor A in Equation 8 was set to 0.1 at the beginning, and was updated as follows: Every 20 iterations, the best and worst lower bounds L(u) during those 20 iterations are compared to each other. If the difference is larger than 1 %, the value of ⁇ is halved.
  • is multiplied with 1.5.
  • the upper bound, UB used is the sum of the costs of the first primers that together cover all rows four times. Otherwise it is the value of the best solution found so far.
  • the last vector from the subgradient phase is used to generate a sequence of multiplier vectors (again using Equation 8), and a feasible solution constructed for each of the multiplier vectors.
  • the procedure used to generate a feasible solution is a variation of the greedy algorithm, where each column is scored according to
  • R is the set of uncovered rows in each step.
  • the column with the lowest q i.e. the columns with the best "gain/cost"-ratio, is added in each step to the solution. This continues until no improvements to the best solution (i.e. minimum number of primers) have been made for 50 iterations.
  • the heuristic phase column fixing is applied to the solution. Columns that are absolutely necessary in order for a row to be covered (i.e. if there are only e columns covering a row and each row is to be covered e times) are fixed. These fixed columns are then used as a starting point for the greedy algorithm, and the first max ⁇ [200/mj, 1 ⁇ columns chosen therein are fixed as well.
  • max ⁇ c. (w * ), ⁇ + £ ,. M ;
  • u,(K : - 1) is the contribution of row / ' to the gap between the estimated lower and upper bound of the problem. This is then split uniformly between all columns in the solution covering that row. Columns with small ⁇ j (contributing the least to the gap) are then likely to be part of the optimal solution. The p columns with the smallest ⁇ are then fixed before the entire algorithm is applied again to the resulting sub-problem. (Column fixing here has nothing to do with column fixing after the heuristic phase, so columns fixed there need no longer be fixed here), p is the smallest value satisfying
  • the number of columns fixed in this step was also set to be at least one more than in the previous iteration (if no improvements were made). Otherwise the same number of columns would be fixed in a number of iterations before the value of ⁇ is large enough to allow more columns to be fixed.
  • the algorithm is iterated until either the value of the best solution is less than the estimated lower bound, all columns in the best solution found so far are already fixed in the refining step or a time limit is exceeded.
  • the time limit in this case was arbitrarily set to as many seconds as there were rows in the problem. However, the time limit is only checked before the refining step. If it is not exceeded, a whole iteration of the algorithm will be executed before another check is done. Here too a check was done afterwards to see if primers could be removed without breaking any constraints.
  • the primers were initially sorted in order of score, this need not be performed.
  • the algorithms for stripping out redundant primers are capable of operating with any order of primers including a wholly random order. However, slightly better results were obtained when ordering by score was performed.
  • the HLA-sequences were available internally from Amersham Pharmacia Biotech (release December 1997), and included 91 alleles from HLA-A, 202 HLA-B, 47 HLA-C, 11 HLA-DPA1 (coding for the ⁇ -chain), 74 HLA-DPB1 ( ⁇ -chain), 18 HLA-DQA1 , 34 HLA-DQB1 , 192 HLA- DR1 and 35 sequences in all of HLA-DR3, -DR4 and -DR5. The length of these sequences range from ⁇ 250bp to -1100bp.
  • the 16S rRNA-sequences were collected from GenBank
  • Table 1 Details about data sets. The program was written using the Microsoft ® Visual C++ ® , version 5.0 compiler. It was executed on a PC with a Pentium ® MMX 233 MHz processor, 64 MB RAM and Windows ® 95, unless otherwise indicated. All execution times are for the entire program, including I/O.
  • the binary SCP matrices were quite dense. The density (i.e. the number of non-zero elements in the matrix) usually lies around a few percent, of course depending on the application. A higher density means that fewer columns are needed in order to cover all rows. This is offset in this case by the fact that all rows were required to be covered multiple times. Another consequence of this high density is that the number of primers needed according to the greedy algorithm could be much higher than in the optimal solution. (Recall that the worst case behaviour of the greedy algorithm is a function of the largest column-sum of elements.)
  • Table 2 Some details about the binary SCP matrix. Data are calculated for all primers in the primary set.
  • the program could be considered as consisting of two phases.
  • the first phase involves constructing all primers and finding out what kind of signal they will get for each sequence.
  • the second phase is the optimisation phase, were the SCP is solved.
  • Table 3 Number of primers in different stages of the algorithm and time to get signals for all primers.
  • the number of primers in the core are for homozygotes.
  • One explanation to this high density is that the sequences in the data sets are quite similar to each other, so that most primers will hybridise to and give signal for more than one sequence (either the same or different signals).
  • This is also indicated in Table 3, where for some data sets there is a noticeable drop from the number of primers in the first set to the number of primers in the primary set. Most of this reduction is due to a primer having the same signal for all sequences, which in turn means that all sequences have a substring that is similar enough for the primer to hybridise to and that the nucleotide after the primer is the same for all sequences.
  • the 16S rRNA data set has a much lower density, and no reduction in the primers going from the first set of primers to the primary set.
  • sequences in this data set come from organisms which might be only distantly related to each other, there need not be as much similarity between the sequences as there is in the HLA data sets.
  • Table 4 No. of primers after the greedy algorithm and time spent by it. Also final nr. of primers after check for redundancy and the total time spent solving the SCP. *Value from a 300MHz Pentium II with 512MB RAM running Windows NT 4.0. The computation was halted before completion due to time constraints.
  • results from combining HLA sequences in order to differentiate between heterozygous individuals can be found in Table 7.
  • CFT was only used for the two smallest data sets due to the time requirements. It performed slightly better than the greedy algorithm on those, but only by one primer on each data set.
  • Table 7 Results from heterozygous pairs. Number of primers needed, the time spent, how many heterozygotes that did not differ by at least four signals from any other heterozygote and the percentage of total number of heterozygotes. * Value from a 300MHz Pentium II with 512MB
  • Table 8 Heterozygous pairs that do not differ enough in their signal patterns, and how many signals they differ with.
  • Table 9 Number of primers needed to discriminate between heterozygote HLA samples.
  • Primers can be arranged on the surface of a support in such a way that different studied types, genes, alleles, species etc. form easily recognised characters such as figures or letters. These character forming primers can be additional primers of common origin from the gene of interest and be used for validation of the process.
  • DNA Four homozygote for DQB cell lines, with alleles 0402, 0301 , 06011 and 0201.
  • Amplification reagents PCR mix from the Amersham Pharmacia Biotech HLA DQB typing kit, a prototype kit.
  • SAP will degrade (dephosphorylate) all free dNTPs and UDG will remove all dU from the DNA and after heating the strands will be broken at these points. This step is applicable to any DNA fragment.
  • A, C,G, T amino TTT AGC CTT AAC GCC T X TGAC GTCA, where X is A, C. G or T.
  • Cy2 - ddCTP (equal to fluorescein) 50 ⁇ M Cy3 - ddATP 50 ⁇ M
  • the DQB amplification was done according to the method described by Williams et al. -96 using a 33% dUTP mix. After 40 cycles (95°C, 30 sec; 55°C, 30 sec; 72°C, 30 sec), one microliter of the PCR products was tested on a 1.5% agarose gel, before the fragmentation step.
  • the samples were frozen and stored until they were used.
  • the detection system is a total internal reflection fluorescence (TIRF) system, where microscopic slides are placed on top of a prism with oil on to link a laser beam in to the glass slide.
  • the system has light of five different wave lengths from five different lasers to vary between. In this experiment only four were used.
  • TIRF total internal reflection fluorescence
  • the DNA from the four DQB homozygote cell lines were amplified according to the protocol in Williams et al. -96 with two different concentrations of dUTP. In addition to this, DNA from six different heterozygotes were amplified. All amplifications worked well and the expected 300 bp fragment were seen from all samples.
  • Primer chips were washed and fragmented PCR products were incubated on the chip according to the protocol. The image was compared to the expected pattern. The expected pattern was similar to but somewhat different from the recorded pattern, the reason for this is that the set up was planned for a 500 bp fragment, but the actual fragment used was a 300 bp PCR fragment.
  • Figure 4 shows the results from a cell line homozygous for the DQB 0204 allele.
  • the pattern shown in the image is very close or similar to the expected results from exon 2.
  • Another improvement that can be made is the following: As is, the program works only with discrete signals, e.g. either there is a signal 'A' or there is not, either there is a signal 'G' or there is not and so on. A more precise approach would be to predict how strong the signals will be for each primer on each sequence. A rough estimate of the signal strength should be possible given some thermodynamic data about the primers, most notably their melting points. With this information, and knowing the concentration of DNA in the sample among other things, the proportion of primers on the chip that will actually react with the sample DNA should be possible to estimate. It would thus allow a rough estimation of what strength the different signals will have. It will not be very precise, and the estimate might possibly be off by a factor 2 or more, but it will still give some information about what signals to expect from the chip.
  • the temperature at which the reaction on the chip is carried out could be optimised as well. Since the sequences are known, it is possible to estimate the melting point of any primer to any sequence when there are a few mismatches. This could be done for all primers on all sequences, and a range of temperatures calculated. The actual temperature to use could then be chosen so as to be as optimal for as many primers on as many sequences as possible, instead of as now at a standard temperature.
  • the algorithm itself could be improved.
  • the complexity of the redundancy-check phase can be slightly reduced by having a vector consisting of the sums of the rows in each node. For each child-node, the column to be removed is then subtracted from this vector of sums. This operation can be carried out in O(m), and the final complexity will then be 0(m ⁇ N(p, p)) instead.
  • the greedy algorithm another possible improvement is to check the primer set for redundancy each time a primer was added.
  • the complexity for the greedy algorithm will be the same, as the check will take 0(m xp) (i.e. same as each iteration in the greedy algorithm) each time (with the improvement just mentioned). The check could take longer, but that is unlikely as that would imply that one primer could make several other primers redundant.
  • the main advantage is, of course, that no redundancy check with its rather high complexity is needed afterwards.
  • this method is only capable of identifying known gene- variants. If applied to a sample with a previously unknown variant, it is very probable that this new variant will be falsely identified as one of the known variants. It would be very advantageous if this method could be augmented in some way to recognise this fact, and give a warning if there could be an unknown variant in the sample. It could be done by giving a warning when the signal pattern gained differs from the signal pattern from any known variants, but this might not be enough. There is no guarantee that the new variant could not differ in some place not affecting any of the existing primers, which would lead to the new variant being indistinguishable from any of the known variants. Some other way is probably needed as well.
  • TTTGCAAGTCCTCCTC ⁇ T ⁇ TCTCCTCCCGGT ⁇ TCCACAACCCGGTA ⁇ TTGGCCAGGTGGACA ⁇ TTGCGGTTCCTGGAG ⁇ TTCAGCCAGAAGGAC ⁇ TTGACTCGCCTCTGC ⁇ TTTCCAGGACTCGGC
  • TTTTTGTACAGACGC TTCGGTCTCCTTCTT TTGCAATGGGGAGCC TTTGGATCTGGATAA ⁇ ⁇ GATGAAGATGAG
  • GGTCACACCCCG GGGAGTTCCGGGC AGGAGGAGACAAC GGGTGGACACAAC TCTGCTCGGTGAC TGGGGCGGCTTGA GCGCACGTCCTCC TAGGATTTCGTGTA

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Engineering & Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Cell Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method is described for identifying a rather small set of extendible primers for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms. A matrix of primers and pairs of primer extensions is prepared and subjected to analysis by a set covering problem algorithm, e.g. a greedy algorithm or one which invloves a Lagrangian relaxation heuristic. Sets of primers are described for use in the identification, classification or typing of an organism, allele or gene selected from class 1 HLA, class 2 HLA and 16S rRNA.

Description

PRIMERS FOR IDENTIFYING TYPING OR CLASSIFYING NUCLEIC ACIDS
DNA-sequence analysis is rapidly becoming a standard tool in modern, molecular biology research. Examples of applications include: Sequencing of unknown DNA-sequences, Identifying novel genes in stretches of sequenced DNA, Predicting protein-sequence and -structure from DNA-sequence alone and Identification of known gene-variations (sometimes called "typing a gene").
Typing of a gene could be crucial in some applications. For instance, organ-donation requires that the "immunological signature" of the donor matches that of the receiver. This "signature" is mediated by the Human Leucocyte Antigen (HLA) complexes (also known as Major Histocompatibility Complex, MHC) on the cell surface, and the corresponding genes are among the most varied in the human genome. Considering the importance of organ donation, the shortage of organ- donors and the fact that an organ cannot be stored for any longer time- periods, a rapid and accurate typing of the HLA-genes is required in order to make most use of the organs available for transplantations.
Another application where a rapid and accurate identification of a gene is desired is when trying to identify unknown bacteria. A rapid identification of the bacteria causing the illness of a patient makes it possible to administer the correct medication early in the treatment of the disease, thus reducing the discomfort for the patient. Since every self- replicating organism so far studied use ribosomes when translating mRNA to proteins, analysis of one of the genes coding for the ribosome, for instance the 16S rRNA in the case of prokaryotes, could be used to identify the organism in question.
There are several ways in which a gene can be identified, with the conceptually easiest being to sequence the entire gene and then looking at the result. The main drawback is that this approach is time- consuming, and not easily scaled up using conventional methodology. A new method, Arrayed Primer Extension (APEX), lacks this drawback. APEX works by immobilising a large number of primers to a solid surface, thus creating a DNA-chip. These primers are constructed to be consecutively overlapping over the entire gene of interest, so that every base in the gene will have a primer to its 5'-end. By adding fluorescently labelled dideoxynucleotides, the primers will then be extended by one nucleotide using the sample DNA as template. It will thus be easy to check which nucleotide was incorporated, which in turn tells you the entire sequence of the sample DNA.
Since some genes, like the HLA and 16S rRNA, have a large number of known variations, a prohibitively large number of primers have to be created in order to probe for all possible combinations of variant positions in the gene. Thus the array primer extension method APEX for resequencing would need more than 16,000 primers if all DQB alleles would be sequenced from a 500 bp long PCR fragment. If all DQB alleles in pairs should be combined the number of primers might be even higher which would be the situation for a heterozygote found in most individuals. But this might not be necessary, if some variations always or never occur together. This needs to be studied though, and a way found to determine the least number of primers (and what their sequences are) required for unambiguously identifying those genes.
An object of this invention is to find and implement an efficient algorithm capable of doing just that. The algorithm should preferably also take into account the melting points of the primers, so that the extension reaction can take place under optimal conditions for all of the primers on the chip. It should also minimise the number of "self-extended" primers, i.e. primers that can extend themselves without any sample DNA. This algorithm is then to be tested and evaluated on the HLA and 16S rRNA- genes. HLA is chosen partly because of the importance of rapid typing of these genes, leading to the fact that there are many other methods to which APEX can be compared. It is also because the HLA-genes are "easy" to work with, since they rarely contain any insertions or deletions. These kinds of variations in the gene could potentially create problems when designing primers for APEX. The 16S rRNA, on the other hand, contains insertions and deletions and can thus be used to see if the algorithm can handle such variations.
The invention provides a method of identifying a set of extendible primers for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms wherein: i) all possible nucleotide sequences of a chosen length of the nucleic acid are identified and their corresponding extendible primers, ii) at least one extendible primer is removed from the set wherein the at least one primer removed identifies a segment of the nucleic acid identified by at least one other primer. Preferably the method includes between step i) and ii): ia) potential extensions for each primer are identified with respect to each nucleotide sequence, ib) for each extendible primer the identified potential extensions are compared to determine which pairs of sequences can be discriminated by the primer.
Preferably a matrix of primers and pairs of primer extensions is prepared in binary form and is subjected to analysis by a set covering problem (SCP) algorithm as described in more detail below.
The invention also includes a set of extendible primers, for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms, identified by the method as defined. Preferably the primers are attached by 5'-ends to a surface of a support on which they are presented in the form of an array.
In another aspect, the invention provides a set of extendible primers, for use in the identification, typing or classification of a human leucocyte antigen (HLA) gene as indicated, the set comprising about the number of primers indicated and being capable of distinguishing about the number of alleles indicated:
HLA gene Number of Number of
Alleles Primers
Class 1 HLA-A 91 172
HLA-B 200 <1000
HLA-C 47 94
Class II DPA-1 11 26
DPB-1 74 130
DQA-1 17 130
DQB-1 34 84
DRB-1 192 <1000
DRB345 35 94
In another aspect, the invention provides a set of extendible primers, for use in the identification, typing or classification of 16S rRNA, wherein the set comprises about 210 primers and is capable of distinguishing at least about 1207 different sequences.
In these aspects of the invention, the approximate number of primers is indicated. As indicated below, it may be possible by the use of the algorithms exemplified or other algorithms to generate slightly smaller sets of primers capable of distinguishing the number of alleles or sequences indicated, and these sets are envisaged according to the invention. Of course, other primers may be present in addition to those indicated as essential, and may be useful for checking purposes. The number of alleles or sequences indicated represents the approximate known number of polymorphisms or different sequences, and these will surely increase with time.
In another aspect the invention provides a method of identification, typing or classification of a nucleic acid of known sequence having known polymorphisms, by the use of the set of extendible primers as defined, which method comprises applying the nucleic acid or fragments thereof to the set of extendible primers under hybridisation conditions and effecting template-directed chain extension of extendible primers that have formed hybrids. Preferably template-directed chain extension is effected using four different fluorescently labelled chain-terminating nucleotide analogues, and results are analysed by an imaging system such as total internal reflection fluorescence (TIRF) or scanning confocal microscopy. The various steps of the method may be performed as described in the literature for the known APEX technique.
In another aspect the invention provides a kit for use in the identification, typing or characterisation of a nucleic acid of known sequence having known polymorphisms, comprising the set of extendible primers as defined.
In another aspect the invention provides an array of sets of extendible primers as defined, for the simultaneous identification, typing or classification of two or more different HLA genes.
With the present invention it has been realised that where a number of different alleles are to be identified, the total number of primers required to distinguish each of the alleles could be reduced as some primers would be common to all of the alleles, for example. Thus, with the present invention complete sets of primers for identification of each allele are identified and then the total number of primers in the combined sets is reduced using predetermined rules.
Furthermore the present invention is based on the premise that as the primers are used to identify the presence or absence of a particular nucleotide sequence in any allele, the specific nucleotide that extends any particular primer is of less relevance than simply whether the primer has been extended. Thus, the problem of reducing the overall number of primers is greatly simplified rendering the problem one suitable for treatment as a Set Covering Problem (SCP).
Embodiments of the present invention will now be described by way of example with reference to the accompanying drawings and examples, in which:
Figure 1 is a diagram of a signal matrix in accordance with the present invention; Figure 2 is a diagram of the corresponding binary matrix for the signal matrix of Figure 1 ;
Figure 3 is a flow diagram of the steps for reducing the primer set in accordance with the present invention. The following is an explanation to assist in an understanding of the principles underlying the manner in which the number of primers used in the identification of a plurality of sequences may be reduced.
Theoretically the number of primers required to identify k sequences grows as 0(k»l), where / is the length of the sequences as each sequence requires / primers. However, the less the sequences differ from one another, the fewer primers are required as many of the primers required for identification of a first sequence may also be of use in identification of another sequence. This effect becomes more pronounced the greater the number of sequences to be identified and the greater the similarities.
Considering an initial set of n primers required in the identification of k sequences, a signal matrix of k x n can be constructed. Each element in the matrix represents the signal, if any, that is generated by a particular primer with respect to a particular sequence. The signal will either be one of the four nucleotides 'A', 'C\ 'G', or T or no signal '-'. Figure 1 is an example of such a signal matrix where, for example, the signal generated by primer 2 with respect to sequence 3 is T.
The signal matrix is then converted into a binary matrix that represents whether the signals for any particular primer differ with respect to different sequences. Thus, again with respect to primer 2, the same signal 'G' is generated for both sequences 1 and 2 but a different signal T is generated with respect to sequence 3. The binary matrix is constructed by considering each column (each primer) of the signal matrix and comparing each signal in that column in turn. Thus, as shown in Figure 2, the first row of the matrix represents a comparison of the signals for the first and second sequences, the second row represents a comparison of the signals for the first and third sequences and the third row represents a comparison of the signals for the second and third sequences. Binary '0' represents the comparison revealing the same signal and binary '1 ' represents the comparison reveals different signals. In the case of primer 2, as mentioned earlier the signals for the first and second sequences are the same ('0') whereas the signals for the first and third sequences are different ('1 '). This conversion produces a matrix m x n where m=(k(k-1))/2. Hence, for large numbers of sequences, 2m grows approximately as the square of the number of sequences. Figure 2 shows the binary matrix for the signal matrix of Figure 1. As the primers are required to enable the differentiation of sequences from one another, the reduction of the signal matrix to a binary matrix, representing differences in the signals obtained for different sequences, distils that element of information necessary to enable a selection of the minimum number of primers necessary to identify the individual sequences. From the binary matrix the least number of columns are selected such that each row contains at least one non-zero element. Thus, if one of the columns contained all '1 's only that one column would be required. However, in the case of Figure 2, there is no single column containing all s and so two columns must be selected, for example primers 1 and 2. Primers 1 and 2 together enable each of sequences 1 , 2 and 3 to be differentiated and so the remaining primers are redundant.
Where large numbers of sequences and primers are involved, the binary matrix renders the data contained within that matrix suitable for mathematical analysis. Once the selection of the reduced number of primers has been made, though, it is the signal matrix that is required during the use of the primers in the identification of the different sequences. Thus, the signal matrix is used to 'decode' the results of any analysis using the reduced number of primers.
In practice, large numbers of sequences and primers are involved and the selection of a reduced set of primers cannot be performed by simple inspection of the binary matrix. For large numbers of primers, selection of a suitable reduced set of primers can be performed by treating the selection as a Set Covering Problem (SCP). An SCP is an integer optimisation problem and is well known in fields such as airline crew scheduling, selecting manufacturing equipment and ingot mould selection in steel production. In such large scale problems that cannot be solved exactly (NP-hard), heuristics are used in order to generate a solution. As a SCP is NP-hard, global algorithms and algorithms that identify local optima are not very suitable on their own for a large scale SCP. They will simply require far too much computation, as they try to find a solution that can be proven to be at least locally optimal. For this reason heuristic methods are required instead. They do not claim to give even locally optimal solution, but are much faster.
Two known computational methods that have been found to be effective in identifying reduced sets of primers are the 'greedy' algorithm and Lagrangian relaxation algorithm.
Greedy Algorithm
The most simple heuristic is the greedy algorithm, where columns are added one at a time. The column to be added in each step is chosen so as to cover as many uncovered rows as possible (a row is covered if it has at least one non-zero element). In other words, if Sr is the set of columns already included in the solution at iteration r, and Rr is the set of rows with no non-zero elements at iteration r, column;'/ is selected according to:
.eΛ. jr * = arg min c, /Py j * Sr
Equation 1
This continues until all rows are covered, or until no more columns exist which can cover any of the rows still uncovered. Instead of minimising the term Cj / P other terms can be used. Example terms are c,, Cj / log2 Pj or c, / (Pj)2. Greedy algorithms of this type are described in "An Efficient Heuristic for Large Set Covering Problems", Vasko, Wilson, Naval Research Logistics Quarterly 1984, 31 :163-171 the contents of which is incorporated herein by reference. The difference is in how much emphasis to place on the cost of the column versus how many rows the column covers. It is shown, however, that this entire class of heuristics share the same worst case behaviour. If we denote the set of columns in the solution as S and the solution value as Z, then the worst case behaviour can be described as:
he < H(d) opt
Equation 2 where
Figure imgf000011_0001
Equation 3
In other words, how much worse the heuristic solution is compared to the optimal solution is dependent on the maximum number of non-zero elements in the columns. The advantage is that this algorithm is fast, even though its time complexity is 0(m2n) (there can be a maximum of m columns in the solution, i.e. the maximum number of iterations is m. For each iteration the matrix is traversed once to find the next column to be added). Altogether, we have that the time required to solve the problem in the worst case scenario will grow as the number of sequences to the power of five (four due to the number of rows, and one due to the number of columns). In the case of 16S rRNA (see later), where we have -1000 sequences, the matrix will have -500,000 rows. The number of primers (columns) is in this case -250,000.
Laqranqian relaxation
More sophisticated methods exist, which use other kinds of heuristics. One heuristic capable of generating the most optimal solutions is believed to be some kind of Lagrangian relaxation heuristic, where in each iteration the Lagrange multipliers for each column are used to calculate the Lagrangian cost for the columns. Such a Lagrangian relaxation heuristic is described in "A Heuristic Method for the Set Covering Problem", Capara et al Technical Report OR-95-8, Operations Research Group, University of Bologna 1995 the content of which is incorporated herein by reference. A near optimal vector of these costs is then calculated by a subgradient algorithm, before being used as input to a greedy algorithm. This is repeated until no improvements in the solution can be made.
In Lagrangian subgradient methods the Lagrangian of the original problem is considered instead of the original problem. In this case, the Lagrangian will be
n m
L(u) = min ∑ c . (u)x . + κ_
7=1 1=1
J: Equation 4
where u,- is the Lagrangian multiplier for row /'. q(u) is the Lagrangian cost associated with column j, and is defined by
m cJ (u) = cJ -∑αIJuJ ι=l Equation 5 An optimal solution to Equation 4 is given by
0 if Cj (u) > 0
Xj (u) 1 if c/(«) < 0
0 or 1 if c;(w) = 0
Equation 6
L(u) can also be seen as an estimate of the lower bound for the solution, i.e. the sum of the costs for the columns in the optimal solution to the SCP will be > L(u). The solution to the SCP can be found by finding an optimal multiplier vector u instead, but this will require much computation especially for a large SCP. But near-optimal multiplier vectors can be found within short time by using the subgradient vector s(u), defined by
Figure imgf000013_0001
Equation 7
u can be refined iteratively by using for example
Figure imgf000013_0002
Equation 8 where λ > 0 is a step-size parameter and UB is an upper bound on the value of the solution. The initial u° can be defined arbitrarily. To solve the SCP, first a near-optimal multiplier vector u is found. This and Equation 6 is then used as a basis to form a feasible solution. The upper bound UB can then be updated to the value of this feasible solution (if it is better than the previous best solution), and a new near-optimal multiplier vector found and so on until convergence is reached. Another alternative computational method that may be employed to solve such a SCP is 'surrogate relaxation' in which in each iteration a corresponding continuous problem is solved and made feasible before a sub-gradient algorithm is applied. Alternatively, genetic algorithms may be employed in which the 'genome' consists of n bits, one bit for each of the columns.
It should also be borne in mind that as the SCP operates on the binary matrix which only represents differences in signals between sequences for the same primer, a primer in the selected reduced set may generate a negative, '-', signal rather than a positive signal, A, C, G, T. To be sure that the sample does in fact contain a particular sequence it is essential to ensure that for each sequence at least one primer generates a positive signal. Furthermore, in practice redundancy is desirable as all reactions may not occur as intended. Therefore, the least number of positive signals as well as the least number of differences in the signal pattern is preferably larger than one.
With reference to Figure 3, the following is a description of one method of selecting a reduced set of primers.
Firstly, all possible primers are selected (10) using the standard APEX procedure to produce a first set of primers. During this selection a substring of the sequence to be analysed is used to construct one primer, then the substring is displaced by one base and another primer is constructed. This process is carried out from the start of the sequence until the entire sequence has been covered. Both strands of DNA are used and this is repeated for all sequences. The primers should be long enough to be capable of discriminating between exact matches and mismatches involving one or two nucleotide pairs. Conveniently, the primers are 13bp long as this has been found to be sufficient to ensure the reaction, or longer to increase hybrid stability. However, to avoid steric hindrance on the chip each primer may be 5'-tailed. In this example, twelve T's are added to the 5'-end of the primer so that the final length of the primers is 25bp.
Next all primers that are not suitable as primers are rejected (12) and the rest is included in a primary primer set. Unsuitable primers are those where the three bases at the 3'-end are complementary to any substring of the primer. In some instances this can result in the primer being extended by a neighbouring primer and not the sample DNA as a template and for that reason such primers are considered unsuitable.
Also, any primers that would produce ambiguous signals are identified and rejected (14). A primer produces an ambiguous signal where it is not known which of the four bases is in the relevant position.
Each of the remaining primers in the primary set primer is then compared to each sequence in turn to determine whether the primer is extendible by each sequence and if the primer is extendible the base with which it would be extended is determined. A signal matrix of the primers with respect to each of the sequences is thus generated (16).
In order for a primer to be extended using the sample DNA as template, the three bases in the 3'-end of the primer must hybridise to the DNA. Otherwise the enzyme responsible for the extension will not be able to add a nucleotide to the primer. Of the rest of the primer (the poly-T tail excluded), at most two mismatches are allowed, otherwise the primer-DNA duplex is considered to be too unstable to be extended. In ordinary PCR, all the bases must match in order for the primer to be extended. But then the temperature is raised to the melting point, Tm, of the primer in the extension step. In APEX, this reaction is carried out at 45°C, which is around 10°-20° below Tm of most primers. This means that the primers will hybridise to the DNA despite a few mismatches, which is why two mismatches are allowed here.
In some cases a primer could hybridise to a sequence in more than one position, and sometimes a primer could hybridise to both strands of one allele and give different signals. In those cases all the different signals are combined to form one resulting signal (e.g. 'A' and 'C together forms 'M', which is the NC-IUB (NC-IUB, 1985) code for this combination).
For each column of the signal matrix the entries for each row are compared against one another, in other words for each primer the signals produced by the primer for each sequence are compared against each other. A binary matrix is thus generated (18) of the primers with respect to the identity or difference of signals for pairs of sequences. The binary matrix contains non-zero entries where the primer is able to distinguish between a pair of sequences.
The number of pairs of sequences that each primer can distinguish between are counted and a score is allocated to each primer (20) in dependence on the total number of pairs of sequences counted. Thus, the number of non-zero elements for each primer are counted.
Primers that are unable to distinguish between any pairs of sequences are rejected (22) and the remaining primers are sorted (24) in order of their score with the primers with the higher scores at the beginning.
A core of primers is created next (26). The primer with the highest score is selected. Where two primers with equal scores exist, the number of positive signals is determined for each and the primer with the greater number of positive signals is chosen. If both primers remain equal, one is then selected arbitrarily over the other. After the main primer has been selected, the first twenty (five times the desired redundancy which is four here) primers giving positive signals for each sequence in turn are selected for the core. All remaining primers are rejected.
A greedy algorithm is then run (28) using the core set of primers to identify the minimum number of primers necessary to distinguish each sequence. As the greedy algorithm is run, primers are added one at a time with each primer being selected in turn in relation to the number of uncovered rows it is capable of covering. When all rows are covered at least four times the reduced set of primers is checked for any sequences that has fewer than four positive signals and extra primers are added as necessary to meet this minimum requirement. A redundancy check is then performed (30) to identify whether any more primers can be removed. During the redundancy check each primer is "tentatively" removed in turn to see whether the remaining primers meet the minimum requirements.
If not, the next primer is tried. Otherwise the primer is temporarily removed from the set, and the process continues with the next primer in line. This process continues until no more primers can be removed, in which case the last primer to be removed is added back to the set, and the next primer in line tentatively removed and so on. This can be viewed as a depth-first search of a tree where the nodes are combinations of primers, and the number of primers in each node is one less than in a node one level above. The root node thus contains all primers from the greedy algorithm. It has p (the number of primers after the greedy algorithm) primers in it. It also has p child-nodes (because there are p ways in which you can remove one primer from a set of p primers), each with p-1 primers. Each of them has p-1 children with p-2 primers and so on. In this way, all possible combinations of primers in the set fulfilling the requirements are found, and those combinations with the same, least number of primers are saved as the final primer sets.
Instead of applying greedy algorithm to the core set a modified algorithm called CFT may be applied.
Lagrangian subgradient
This algorithm consists of three main phases: A subgradient phase where a near-optimal multiplier vector is found, a heuristic phase where a solution to the SCP is found and column-fixing, designed to improve the results of the heuristic phase. In the subgradient phase, a near-optimal multiplier vector u is found using Equation 8. At the beginning, the starting vector u° used is defined as
o • cj
U; = mιn — —
*=1
Equation 9 Later calls use the last vector u before column fixing, and apply a small perturbation before using it as the starting vector. The perturbation is randomly (and uniformly) distributed in the range ±10% for each element. The sequence of multiplier vectors is considered to have converged when the improvement in L(u) in the last 50 iterations is smaller than 0.1 %, or when the number of iterations reached 10 x m. The factor A in Equation 8 was set to 0.1 at the beginning, and was updated as follows: Every 20 iterations, the best and worst lower bounds L(u) during those 20 iterations are compared to each other. If the difference is larger than 1 %, the value of λ is halved. If the difference is less than 0.1 %, λ is multiplied with 1.5. In the first call, the upper bound, UB, used is the sum of the costs of the first primers that together cover all rows four times. Otherwise it is the value of the best solution found so far. In the heuristic phase, the last vector from the subgradient phase is used to generate a sequence of multiplier vectors (again using Equation 8), and a feasible solution constructed for each of the multiplier vectors. The procedure used to generate a feasible solution is a variation of the greedy algorithm, where each column is scored according to
Figure imgf000018_0001
Equation 10
where R is the set of uncovered rows in each step. The column with the lowest q, i.e. the columns with the best "gain/cost"-ratio, is added in each step to the solution. This continues until no improvements to the best solution (i.e. minimum number of primers) have been made for 50 iterations. After the heuristic phase column fixing is applied to the solution. Columns that are absolutely necessary in order for a row to be covered (i.e. if there are only e columns covering a row and each row is to be covered e times) are fixed. These fixed columns are then used as a starting point for the greedy algorithm, and the first max{[200/mj, 1} columns chosen therein are fixed as well.
These three phases are then applied again to the problem, with the condition that the fixed columns must be included in the solution this time. Columns already fixed in a previous round can not be removed from the solution. This goes on until either all rows are covered by the fixed columns, or the cost of the fixed columns is larger than the estimated lower bound for the entire problem or if no new columns were fixed in the last iteration.
When the three phases are done, the problem is refined, in order to improve the solution. Here, each column in the best solution found so far is scored according to
. K. -l
^ = max{c. (w*),θ}+ £ ,.M;
,=ι A,
Equation 11 where
Figure imgf000019_0001
Equation 12
and S is the set of columns in the solution. The term u,(K: - 1) is the contribution of row /' to the gap between the estimated lower and upper bound of the problem. This is then split uniformly between all columns in the solution covering that row. Columns with small δj (contributing the least to the gap) are then likely to be part of the optimal solution. The p columns with the smallest ό are then fixed before the entire algorithm is applied again to the resulting sub-problem. (Column fixing here has nothing to do with column fixing after the heuristic phase, so columns fixed there need no longer be fixed here), p is the smallest value satisfying
υJp.
≥ π e x. m
Equation 13
where jk} is the set of columns in the solution ordered with ascending δj, and /y is the set of rows covered by column j. π is in the range 0...1 and controls the percentage number of rows removed after fixing, π = 1 means that no rows will be uncovered, while π = 0 means that no columns will be fixed before reapplying the algorithm. (Since each row has to be covered multiple times, in this case it is not actually the number of rows but the number of elements covering the rows that are regulated by π). In the beginning, π is set to 0.3 and is multiplied with = 1.1 if the best solution so far was not improved in the last application of the three main phases. If a better solution was found, π was reset to 0.3. Because of the density of the matrices, the number of columns fixed in this step was also set to be at least one more than in the previous iteration (if no improvements were made). Otherwise the same number of columns would be fixed in a number of iterations before the value of π is large enough to allow more columns to be fixed.
The algorithm is iterated until either the value of the best solution is less than the estimated lower bound, all columns in the best solution found so far are already fixed in the refining step or a time limit is exceeded. The time limit in this case was arbitrarily set to as many seconds as there were rows in the problem. However, the time limit is only checked before the refining step. If it is not exceeded, a whole iteration of the algorithm will be executed before another check is done. Here too a check was done afterwards to see if primers could be removed without breaking any constraints.
With this algorithm no pricing is performed. Pricing is used to update the core problem, exchanging columns between the core problem and columns outside the core. It was not included here since it was argued that since the costs of the columns are all the same, the best columns would be those with the largest number of non-zero elements. These would be the first columns to be added to the core, and the columns not included in the core would most probably not be better than those included. Also, the pricing step will require some computation which will extend the time required by this algorithm. As is, the computational requirement of this algorithm is several orders of magnitudes higher than for the greedy algorithm. Finally, the main memory available in the computer puts a limit on the how large the problems can be. If pricing was included all data will not fit into the physical memory, forcing the computer to use a swap-file which would increase the computation times considerably.
Using both alternative algorithms described above a minimum number of primers were identified for various sequences. The results are set out below. It will be apparent that the initial manual rejection of primers, steps (12, 14 and 22) need not be performed and instead the algorithms can be applied to the original complete set of primers. However, the initial rejection of obvious failed primer candidates can significantly reduce the computational time required in the later stages. Similarly, in many cases the final redundancy check (30) need not be performed as in many cases little or no reduction in the number of primers was achieved by this final check.
Furthermore, although in the method described above the primers were initially sorted in order of score, this need not be performed. The algorithms for stripping out redundant primers are capable of operating with any order of primers including a wholly random order. However, slightly better results were obtained when ordering by score was performed.
Collecting seguences
The HLA-sequences were available internally from Amersham Pharmacia Biotech (release December 1997), and included 91 alleles from HLA-A, 202 HLA-B, 47 HLA-C, 11 HLA-DPA1 (coding for the α-chain), 74 HLA-DPB1 (β-chain), 18 HLA-DQA1 , 34 HLA-DQB1 , 192 HLA- DR1 and 35 sequences in all of HLA-DR3, -DR4 and -DR5. The length of these sequences range from ~250bp to -1100bp. The 16S rRNA-sequences were collected from GenBank
(Benson et al., 1998), an annotated database of all publicly available DNA sequences. Only a subset of all the available 16S rRNA-sequences were used. The sequences used were all from organisms that could be identified using either the MicroLog or the MicroStation system from Biolog Inc., or the API systems from Counterpart Diagnostics. These systems utilise differences in metabolism in order to identify the organisms, which is the most common way of identifying micro-organisms today. Altogether, 1207 sequences from 523 different organisms were collected from GenBank. 269 of those 523 organisms had only one 16S rRNA sequence among those 1207 sequences. The length of these sequences is between ~1000bp and ~1500bp.
Data set No. sequences Mean length of sequences
DPA1 11 517
DPB1 74 288
DQA1 17 616
DQB1 34 490
DRB1 192 324
DRB345 35 400
HLA-A 91 944
HLA-B 200 900
HLA-C 47 1003
16S rRNA 1207 1452
Table 1 : Details about data sets. The program was written using the Microsoft® Visual C++®, version 5.0 compiler. It was executed on a PC with a Pentium® MMX 233 MHz processor, 64 MB RAM and Windows® 95, unless otherwise indicated. All execution times are for the entire program, including I/O. As can be seen in Table 2, the binary SCP matrices were quite dense. The density (i.e. the number of non-zero elements in the matrix) usually lies around a few percent, of course depending on the application. A higher density means that fewer columns are needed in order to cover all rows. This is offset in this case by the fact that all rows were required to be covered multiple times. Another consequence of this high density is that the number of primers needed according to the greedy algorithm could be much higher than in the optimal solution. (Recall that the worst case behaviour of the greedy algorithm is a function of the largest column-sum of elements.)
Dataset DPA1 DPB1 DQA1 DQB1 DRB1 DRB345 HLA-A HLA-B HLA-C 16S rRNA
No. rows 55 2701 136 561 18336 595 4095 19900 1081 727821
Density (%) 47.89 20.73 36.31 42.18 24.98 37.70 36.31 32.33 30.41 2.04
Table 2: Some details about the binary SCP matrix. Data are calculated for all primers in the primary set.
The program could be considered as consisting of two phases. The first phase involves constructing all primers and finding out what kind of signal they will get for each sequence. The second phase is the optimisation phase, were the SCP is solved. Some details about the first phase can be found in Table 3.
Dataset DPA1 DPB1 DQA1 DQB1 DRB1 DRB345 HLA-A HLA-B HLA-C 16S rRNA
First set 1747 1885 2487 2891 3891 3031 4756 4994 4293 247877
Primary set 1333 1475 2166 2730 3651 3016 3886 4585 3354 247877
Core set 106 321 213 244 385 203 595 750 338 2377
Time (s) 4.67 6.81 11.26 18.51 42.29 14.56 124.74 286.82 61.29 150632
Table 3: Number of primers in different stages of the algorithm and time to get signals for all primers. The number of primers in the core are for homozygotes. One explanation to this high density is that the sequences in the data sets are quite similar to each other, so that most primers will hybridise to and give signal for more than one sequence (either the same or different signals). This is also indicated in Table 3, where for some data sets there is a noticeable drop from the number of primers in the first set to the number of primers in the primary set. Most of this reduction is due to a primer having the same signal for all sequences, which in turn means that all sequences have a substring that is similar enough for the primer to hybridise to and that the nucleotide after the primer is the same for all sequences. In contrast, the 16S rRNA data set has a much lower density, and no reduction in the primers going from the first set of primers to the primary set. As the sequences in this data set come from organisms which might be only distantly related to each other, there need not be as much similarity between the sequences as there is in the HLA data sets. Another explanation is this: If all k sequences except one give the same signal for a primer, that column in the binary SCP-matrix will have k-1 non-zero elements. The density (for that column) will then be (k-1) I (k(k-1)/2) = 2/k. In other words, the density will be higher for smaller values of k, and smaller for larger values. This means that it would be "natural" for smaller matrices to have higher densities, and larger matrices to have lower densities.
In the second phase, solving the SCP, a few different approaches were tried. The results, the minimum number of primers needed and the time required to find this number, can be found in Table 4 and Table 5. Even though the worst case behaviour of the greedy algorithm is not so good in this application, the results are not much worse than when using a Lagrangian subgradient (CFT) method. The greedy algorithm typically needs two or three more primers, while the computation times are much lower for the greedy algorithm. The results show that it is worthwhile to check the results from the greedy algorithm for redundancy. In all cases except one primers could be removed and the resulting primer sets still fulfil all requirements. This is not true for the CFT algorithm, however, as there is only one instance in which the result could be improved. On the other hand, since there is some randomness in the CFT algorithm (an old multiplier vector is disturbed randomly before being used as a starting vector in the next iteration), the results can differ from one execution of the algorithm to another. Sometimes the results can be improved, and sometimes not (results not shown).
Dataset DPA1 DPB1 DQA1 DQB1 DRB1 DRB345 HLA-A HLA-B HLA-C 16S rRNA
Greedy 11 42 32 31 48 24 73 103 51 210
Time (s) 0.27 1.37 0.61 0.71 11.5 0.66 4.61 31.36 1.15 9921.48*
Final 11 41 30 29 44 21 72 99 47 197Λ
Total (s) 0.27 1.81 0.72 0.88 30.3 0.71 6.48 85.14 1.76 >300000Λ
Table 4: No. of primers after the greedy algorithm and time spent by it. Also final nr. of primers after check for redundancy and the total time spent solving the SCP. *Value from a 300MHz Pentium II with 512MB RAM running Windows NT 4.0. The computation was halted before completion due to time constraints.
Dataset DPA1 DPB1 DQA1 DQB1 DRB345 HLA-A HLA-C
CFT 10 38 26 27 20 69 47
Time (s) 10.22 2748.92 60.80 372.56 427.32 4547.33 1091.37
Final 10 38 26 27 20 69 45
Total (s) 10.22 2749.14 60.86 372.61 427.38 4548.49 1111.70 Table 5: Results using modified algorithm CFT.
One reason CFT is not much better than the greedy algorithm could be that it was designed for other instances of SCP. The SCP arising in this application differ in three aspects from those: A) The density is much higher, B) All rows are to be covered multiple times and C) The costs of all columns are all the same.
A comparison was made between the results from the greedy algorithm and from CFT in Table 6. Most of the primers (70% or more) were chosen by both algorithms, indicating that these primers are likely to be part of an optimal solution. However, this is only an indication as the only way to prove this is to find an optimal solution. This will require far too much time even for the smallest data set as the problem is NP-hard.
Dataset DPA1 DPB1 DQA1 DQB1 DRB345 HLA-A HLA-C
Greedy 11 41 30 29 21 72 47
CFT 10 38 26 27 20 69 48
Same 7 33 22 22 14 62 38
Percent (%) 70.00 86.84 84.62 81.48 70.00 89.86 80.85 Table 6: Comparison of primers from the two different algorithms.
Results from combining HLA sequences in order to differentiate between heterozygous individuals can be found in Table 7. CFT was only used for the two smallest data sets due to the time requirements. It performed slightly better than the greedy algorithm on those, but only by one primer on each data set. There are heterozygotes that can not be distinguished from another heterozygote, which can be seen in Table 7. This happens because the combination of two sequences to form one heterozygote could result in exactly the same signal pattern as another combination of homozygotes. In other words, some rows in the signal- matrix will be the same leading to some rows in the binary SCP-matrix not containing any non-zero elements at all. For some of those pairs listed, this is not true, however. They are listed because there were not enough primers that have different signals for these pairs, and so could not meet the requirement of at least four different signals in the signal patterns (Table 8). For the rest, it is simply a limitation of this technique to type HLA-genes. To be able to identify the alleles forming each heterozygote, primers that amplify alleles selectively should be used in the PCR step. This will remove the ambiguities as some heterozygotes simply will be transformed to homozygotes since only one of the alleles in the heterozygote will be amplified and not the other.
Figure imgf000027_0001
Table 7: Results from heterozygous pairs. Number of primers needed, the time spent, how many heterozygotes that did not differ by at least four signals from any other heterozygote and the percentage of total number of heterozygotes. *Value from a 300MHz Pentium II with 512MB
RAM running Windows NT 4.0.
Unfortunately, it was not possible to obtain any results for heterozygotes for the data sets DRB1 and HLA-B, as these were too large to run on existing machines. A very approximate extrapolation of the primers needed for these data sets suggests that the total number of primers for all HLA sets together would be <1000, which can placed on one chip without problem (one chip can contain up to -5000 primers). Without the reduction obtained above, at most two genes could be tested on each chip. With the reduction, all nine HLA genes and the 16S rRNA gene can be tested on one chip, and with plenty of room to spare for other genes as well. This makes APEX more versatile, as it allows a family of related genes to be tested using only one chip instead of several.
Figure imgf000028_0001
Table 8: Heterozygous pairs that do not differ enough in their signal patterns, and how many signals they differ with.
The results of this work are summarised in the following
Table 9
Class 1 Number of Primers Class II Number of Primers alleles needed alleles needed
HLA-A 91 172 DPA1 11 26
HLA-B 200 <1000 DPB1 74 130
HLA-C 47 94 DQA1 17 51
DQB1 34 84
DRB1 192 <1000
DRB345 35 94
Table 9. Number of primers needed to discriminate between heterozygote HLA samples.
Some sets of primers indicated in Table 9, and also the set indicated for 16S rRNA, are set out in appendix 2.
Primers can be arranged on the surface of a support in such a way that different studied types, genes, alleles, species etc. form easily recognised characters such as figures or letters. These character forming primers can be additional primers of common origin from the gene of interest and be used for validation of the process.
The following demonstration is based on the HLA Class II DQB gene.
Experimental
Materials
Amplification: DNA: Four homozygote for DQB cell lines, with alleles 0402, 0301 , 06011 and 0201.
Primers: Primer DQB 9246 from Williams et al. -96 and DQB 96012 from Amersham Pharmacia Biotech HLA DQB typing kit, covering exon 2, re¬
generating a fragment of 300 base pairs.
Amplification reagents: PCR mix from the Amersham Pharmacia Biotech HLA DQB typing kit, a prototype kit.
All amplifications were spiked with dUTP, to get a final concentration of 100 or 200 mM dUTP.
Enzymes for fragmentation of PCR products: Shrimp alkaline phosphatase (SAP)1 U/μl APB. Uracil-DNA-glycosylase, (if from PE UDG = UNG) 1 U/μl NE Biolabs.
SAP will degrade (dephosphorylate) all free dNTPs and UDG will remove all dU from the DNA and after heating the strands will be broken at these points. This step is applicable to any DNA fragment.
Primers for spotting:
All 84 primers for the 500 bp fragment were ordered from LTI/GIBCO BRL Custom primers service. All were 25-mers with an amino- activated 5' -end. For primer sequences see appendix 1. Self extended primers were N, A, C, G and T as controls with the following sequences: N: amino TTT AGC CTT AAC GCC T N TGAC GTCA
A, C,G, T: amino TTT AGC CTT AAC GCC T X TGAC GTCA, where X is A, C. G or T.
Extension reagents for the APEX reaction Dyes: Specially synthesised for Baylor by Du Pont and /or APB
Cy2 - ddCTP (equal to fluorescein) 50 μM Cy3 - ddATP 50 μM
Texas Red - ddGTP 50 μM
Cy5 - ddUTP (often written as T in many of the reactions and results) 50 μM
10x ThermoSequenase™ DNA polymerase buffer (TS): 260 mM Tris-HCI pH 9.5; 65 mM MgCI2, ThermoSequenase DNA polymerase (Amersham Pharmacia Biotech) 4 U/μl, if needed dilute with T.S. dilution buffer (=10 mM Tris-HCI pH 8.0; 1 mM β-mercaptoethanol, 0.5% Tween - 20(v/v), 0.5% Nonidet P-40 (v/v). TS was used from a 150 unit stock and diluted 1 μl + 37 μl dilution buffer.
Methods
Preparation of glass slides before spotting of primer:
Arrange 25-30 cover slips (24 x 60 mm) in a stainless staining tray.
Immerse the tray in glass staining dish with acetone to fully immerse slides.
Place the glass staining dish in sonicator for 10 minutes. Remove the tray from acetone bath, shake of excess of acetone and rinse several times (at least twice) in MilliQ water.
Immerse tray in 100 mM NaOH and sonicate for 10 minutes (a few more minutes, no problem).
Remove the tray and shake of excess of NaOH and rinse several times (at least twice) in MilliQ water. Immerse tray in silane solution and sonicate for 2 minutes.
Wash slides by immersion in 100% EtOH once. Dry the tray with the slides using nitrogen with a high velocity (without breaking the slides).
Cure the slides in a vacuum oven at 100°C over night or until they are used for spotting (at least 20 minutes vacuum is needed).
Spotting of oliqos:
All spotting was done with a spotter with 96 parallel capacity.
Each slide was spotted with three replicas of the primers. After spotting the slides were allowed to air dry for 5 to 15 minutes, when dried they were marked. They were stored at room temperature, in a dry place, in the trays until used. DQB amplification
The DQB amplification was done according to the method described by Williams et al. -96 using a 33% dUTP mix. After 40 cycles (95°C, 30 sec; 55°C, 30 sec; 72°C, 30 sec), one microliter of the PCR products was tested on a 1.5% agarose gel, before the fragmentation step.
Williams, Bassinger, Moehlenkamp, Wu, Montoya, Griffith, McAuley, Goldman, Maurer: Strategy for distinguishing a new DQB1 allele (DQB1*0611) from the closely related DQB1*0602 allele Tissue Antigens, 1996, 48:143-147.
Fragmentation of PCR products:
Before APEX can be done all DNA fragments must be fragmented so all new fragments can get access to the primer on the chip.
Set up:
5 μl DNA from a PCR reaction (1/10 of the PCR reaction)
2 μl SAP (Shrimp alkaline phosphatase) 1 U/μl APB
1 μl UDG (Uracil-DNA-glycosylase) 1 U/μl NE Biolabs
15 μl water Total: 23 μl
Incubate 37°C for 2 hour.
The samples were frozen and stored until they were used.
Inactivation of enzymes at 100°C for 10 minutes can be done, but not needed since this is the first step in the APEX reaction.
Extension method for the APEX reaction
Slide treatment:
Start with washing the slides in hot water (90 - 98°C, not boiling) for 2 x 5 minutes in a 50 ml Flacon tube. When the slides are ready, remove them from the tube with a forceps and place them on a dry heater block at 48°C. The slide(=DNA chip) is now ready for adding the reactions.
APEX reactions set up:
23 μl DNA from the fragmentation step.
3 μl 10x TS reaction buffer (the rest of the buffer comes from PCR and
UDG cleavage)
17 μl for cover slip method. Heat denature at 100°C for 7 - 10 minutes, target 8 minutes, not longer.
Spin the tube quickly and add quickly
1 μl ThermoSequenase DNA polymerase (4U)
1 μl Dye-mix (50 μM of the four dideoxynucleotides A, C, G, and T, separately dye labelled). Then the reaction mix was physically spread out over the primer array with the tip of a pipette tip. Incubate at 48°C until no trace of solution is seen. This takes about 8 minutes.
Wash with hot water for 2 - 5 minutes, 2 times. Ready to read on detection instrument.
Detection
The detection system is a total internal reflection fluorescence (TIRF) system, where microscopic slides are placed on top of a prism with oil on to link a laser beam in to the glass slide. The system has light of five different wave lengths from five different lasers to vary between. In this experiment only four were used. To detect Cy2 a laser with 488 nm was used, for Cy3 a 532 nm, for Cy5 a 635 nm and for Texas Red a 670 nm laser were used. Image related software were based on Image Pro Plus 3.0. Results
Amplification of HLA DQB alleles
The DNA from the four DQB homozygote cell lines were amplified according to the protocol in Williams et al. -96 with two different concentrations of dUTP. In addition to this, DNA from six different heterozygotes were amplified. All amplifications worked well and the expected 300 bp fragment were seen from all samples.
APEX reaction with DQB chip
Primer chips were washed and fragmented PCR products were incubated on the chip according to the protocol. The image was compared to the expected pattern. The expected pattern was similar to but somewhat different from the recorded pattern, the reason for this is that the set up was planned for a 500 bp fragment, but the actual fragment used was a 300 bp PCR fragment.
Homozvgous cell lines results
Figure 4 shows the results from a cell line homozygous for the DQB 0204 allele. The pattern shown in the image is very close or similar to the expected results from exon 2.
In all reaction the control primers worked well and the four dyes were used in the same frequencies. In the case with a 500 bp fragment for DQB typing the primers for allele 0402 were placed in such a way that they formed figures. In Figure 4, panel D, most signals are seen forming a "2" from the 300 bp fragment, and the missing signal will be seen when the large PCR fragment is used. This clearly shows that primers can be placed in a clever way to form figures.
Heterozygous results
For the heterozygous test only one of the four dye reactions worked. Some of the expected spots from the heterozygous sample were not seen, but this is probably due to the fact that no control signals were seen in the lower right hand corner, where the signals were weaker then in other part of the slide.
As this experiment shows, a limited number of primers can be used for HLA typing and if they are placed in a clever way the interpretation of the results is very simple. Both homozygous and heterozygous samples can be correctly analysed with this method.
Continuation An algorithm was developed in order to select the minimum number of primers needed to identify different genes using APEX. It was applied to the following HLA genes: HLA-A, HLA-B, HLA-C, HLA-DPA1 , HLA-DPB1 , HLA-DQA1 , HLA-DQB1 , HLA-DRB1 and HLA-DRB345. It was also applied to the 16S rRNA gene. In the case of HLA-DQB1 , the primers have been shown to work as intended. As is, a few assumptions were made (such as how many mismatches to be allowed between the primers and the sample DNA) that need to be tested and possibly refined.
Another improvement that can be made is the following: As is, the program works only with discrete signals, e.g. either there is a signal 'A' or there is not, either there is a signal 'G' or there is not and so on. A more precise approach would be to predict how strong the signals will be for each primer on each sequence. A rough estimate of the signal strength should be possible given some thermodynamic data about the primers, most notably their melting points. With this information, and knowing the concentration of DNA in the sample among other things, the proportion of primers on the chip that will actually react with the sample DNA should be possible to estimate. It would thus allow a rough estimation of what strength the different signals will have. It will not be very precise, and the estimate might possibly be off by a factor 2 or more, but it will still give some information about what signals to expect from the chip.
Given the melting points of the primers, the temperature at which the reaction on the chip is carried out could be optimised as well. Since the sequences are known, it is possible to estimate the melting point of any primer to any sequence when there are a few mismatches. This could be done for all primers on all sequences, and a range of temperatures calculated. The actual temperature to use could then be chosen so as to be as optimal for as many primers on as many sequences as possible, instead of as now at a standard temperature.
Another possibility would be to try other heuristics to solve the resulting SCP. Even though CFT does give better results than the greedy algorithm, it is not by much. It could be that Lagrangian relaxation methods really are not suitable for unicost problems, but the only way to find out is to try heuristics based on other ideas. It might be possible to reduce the binary SCP-matrix as well, before applying any heuristic on it. Some rows in the matrix could end up the same, in which case one of them could be removed in order to reduce the number of rows and thus speed up computation. No figures of how many rows might be the same exist, but it could be worthwhile examining this possibility to reduce problem size.
The algorithm itself could be improved. The complexity of the redundancy-check phase can be slightly reduced by having a vector consisting of the sums of the rows in each node. For each child-node, the column to be removed is then subtracted from this vector of sums. This operation can be carried out in O(m), and the final complexity will then be 0(m χ N(p, p)) instead. For the greedy algorithm, another possible improvement is to check the primer set for redundancy each time a primer was added. The complexity for the greedy algorithm will be the same, as the check will take 0(m xp) (i.e. same as each iteration in the greedy algorithm) each time (with the improvement just mentioned). The check could take longer, but that is unlikely as that would imply that one primer could make several other primers redundant. The main advantage is, of course, that no redundancy check with its rather high complexity is needed afterwards.
The most serious problem is the sheer size of the problems. For the 16S rRNA data set, around 300 MB is required just in order to store all the primers and their signals. Add to that the fact the all primers need to be traversed once for every iteration in the greedy algorithm, and the result is that it will take quite some time as well. This also means that it is not even feasible to use more elaborate algorithms such as the CFT algorithm on the 16S rRNA data set, unless a much more powerful computer is available. On the other hand, algorithm CFT would probably benefit quite a lot from a parallel computer, since much computation could be carried out as vector-operations. It should then be possible to spread out all computations on several processors, thus reducing the time required. It would also reduce the memory requirements on each processor (but then parallel computers tend to have enough memory to store all necessary data for this problem on each processor anyway). Even the greedy algorithm would benefit from a parallel computer, as each processor can be charged with the task of scoring only a subset of primers. It is not as critical in this case, though, since the computation times are not very high when using the greedy algorithm.
As is, this method is only capable of identifying known gene- variants. If applied to a sample with a previously unknown variant, it is very probable that this new variant will be falsely identified as one of the known variants. It would be very advantageous if this method could be augmented in some way to recognise this fact, and give a warning if there could be an unknown variant in the sample. It could be done by giving a warning when the signal pattern gained differs from the signal pattern from any known variants, but this might not be enough. There is no guarantee that the new variant could not differ in some place not affecting any of the existing primers, which would lead to the new variant being indistinguishable from any of the known variants. Some other way is probably needed as well. APPENDIX 1
Primer sequences for DBQ heterozygote typing
Primers 'dqbl -V to 'dqbl -8' placed in positions A3-A10. Primers 'dqbl -9' to 'dqbl -18' placed in positions B2-B11. Primers 'dqbl -19' to 'dqbl -30' placed in positions C1-C12. Primers 'dqbl -31' to 'dqbl -42' placed in positions D1-D12. Primers 'dqbl -43' to 'dqbl -54' placed in positions E1-E12.
Primers 'dqbl -55' to 'dqbl -66' placed in positions F1-F12. Primers 'dqbl -67' to 'dqbl -76' placed in positions G2-G11. Primers 'dqbl -77' to 'dqbl -84' placed in positions H3-H10. dqb1-1 NH2 - TCC ATC ACA GGA GTC AGA AAG GGC T dqb1-2 NH2 - GTG TGC AGA CAC AAC TAC GAG GTG G dqbl -3 NH2 - GCG GTG ACG CTG CTG GGG CTG CCT G dqb1-4 NH2 - TAA TGA GGG GGG TGG ACA CAA CGC C dqb1-5 NH2 - GCG GTG ACG CCG CTG GGG CCG CCT G dqb1-6 NH2 - GGA CAT CCT GGA GGA GGA CCG GGC G dqb1-7 NH2 - GTG GTG ACG CCG CTG GGG CCG CCT G dqbl1-8 NH2 - TCC GTC AAA GGA GTC AGA AAG GGC T dqb1-9 NH2 - GAT GTA TCT GGT CAC ACC CCG CAC G dqb1- 0 NH2 - CCG AGT ACT GGA ATA GCC AGA AGG A dqb1-11 NH2 - GAT GTG TCT GGT CAC ACC CCG CAC G dqb1-12 NH2 - GGG TGG ACA CAA CGC CGG CTG TCT C dqbl -13 NH2 - GGG TGG ACA CAA CGC CGG TTG TCT C dqb1-14 NH2 - CTT CTG GCT ATT CCA GTA CTC GGC G dqb1-15 NH2 - TTC CGG GCG GTG ACG CTG CTG GGG C dqbl -16 NH2 - GCT TCG ACA GCG ACG TGG GGG TGT A dqb1-17 NH2 - GCT GTT CCA GTA CTC GGC GCT AGG C dqbl -18 NH2 - CTT CTG GCT GTT CCA GTA CTC GGC G dqb1-19 NH2 - ACC GTG TCC AAC TCC GCC CGG GTC C dqb1-20 NH2 - CAC AAC GCC GGT TGT CTC CTC CTG G dqb1-21 NH2 - CTC CTC CTG GTC ATT CCG AAA CCA C dqb1-22 NH2 - CCA GGA TCT GGA AAG TCC AGT CAC C dqbl -23 NH2 - GAG CGC GTG CGT CTT GTA ACC AGA T dqb1-24 NH2 - GAC ATC CTG GAG AGG AAA CGG GCG G dqbl -25 NH2 - AGA GAC TCT CCC GAG GAT TTC GTG T dqb1-26 NH2 - TAG TTG TGT CTG CAC ACC CTG TCC A dqb1-27 NH2 - ACG TAC TCC TCT CGG TTA TAG ATG T dqbl -28 NH2 - GCT TCG ACA GCG ACG TGG AGG TGT A dqb1-29 NH2 - TCC GTC CCA TTG GTG AAG TAG CAC A dqbl -30 NH2 - TGA TAA GGC CCA GCC CGA GGA AGA T dqbl -31 NH2 - GGG TGG ACA CAA CGC CAG TTG TCT C dqb1-32 NH2 - GGG TGG ACA CAA CGC CAG CTG TCT C dqbl -33 NH2 - GAC AGC GAC GTG GAG GTG TAC CGG G dqb1-34 NH2 - TCC GTC CCG TTG GTG AAG TAG CAC A dqb1-35 NH2 - GCA CGA CCT TGC AGC GGC GAC CCC A dqb1-36 NH2 - GAA CAG CCA GAA GGA AGT CCT GGA G dqb1-37 NH2 - CTT CTG GCT GTT CCA GTA CTC GGC A dqbl -38 NH2 - AAC GCC AGC TGT CTC TTC CTG GTC A dqb1-39 NH2 - GAG AGG ACC CGG GCG GAG TTG GAC A dqb1-40 NH2 - GCA GGC GGC CCC AGC GGC GTC ACC A dqb1-41 NH2 - GTC GCT GTC GAA GCG CAC GTC CTC C dqb1-42 NH2 - CTC TGT CCT GGA TGG GGT CGC CGC T dqb1-43 NH2 - ACG GGA CGG AGC GCG TGC GTT ATG T dqbl -44 NH2 - GAA GTA GCA CAT GCC CTT AAA CTG G dqb1-45 NH2 - TCG GTG GAC ACC GTA TGC AGA CAC A dqb1-46 NH2 - GGA CGT GTA CCA GTT TAA GGG C dqb1-47 NH2 - ACG TAC TCT TCT CGG TTA TAG ATG T dqb1-48 NH2 - GAG AGG ACC CGA GCG GAG TTG GAC A dqb1-49 NH2 - ACC CCA GCC TCC AGA GCC CCA TCA C dqb1-50 NH2 - CAA CGG GAC GGA GCG CGT GCG GGG T dqb1-51 NH2 - ACA TCT ATA ACC GAG AGG AGT ACG C dqbl -52 NH2 - GAA CAG CCA GAA GGA CAT CCT GGA G dqb1-53 NH2 - CCT TCT GGC TAT TCC AGT ACT CGG C dqb1-54 NH2 - TTA AGG CCA TGT GCT ACT TCA CCA A dqb1-55 NH2 - TTC AGA TTG AGC CCG CCA CTC CAC G dqb1-56 NH2 - ATC TGG TCA CAA GAC GCA CGC GCT C dqb1-57 NH2 - AGT AGC ACA GGC CCT TAA ACT GGT A dqb1-58 NH2 - ATG TAT CTG GTC ACA CCC CGC ACG A dqb1-59 NH2 - ATC TGG TCA CAT AAC GCA CGC GCT C dqb1-60 NH2 - ATC AAA GTC CAG TGG M CGG AAT G dqb1-61 NH2 - ACG TGG GGG TGT ATC GGG TGG TGA C dqb1-62 NH2 - ATC AAA GTC CGG TGG M CGG AAT G dqb1-63 NH2 - GTA TCT GGT CAC ACC CCG CAC GAG C dqb1-64 NH2 - CGC TGT CGA AGC GCA CGT CCT CCT C dqb1-65 NH2 - GGA M CGT GTT CCA GTT TAA GGG C dqb1-66 NH2 - TGT GGG CTC CAC TCT CCT CTG CAA G dqbl -67 NH2 - ACG TCC TCC TCT CGG TTA TAG ATG T dqb1-68 NH2 - TTG CAG CGG CGA CCC CAT CCA GGA C dqb1-69 NH2 - GAA GTA GCA CAG GCC CTT AAA CTG G dqbl -70 N H2 - GAA GTA GCA CAT GGC CTT AAA CTG G dqb1-71 NH2 - TCG ACA GCG ACG TGG GGG TGT ACC G dqbl -72 NH2 - TCG ACA GCG ACG TGG GGG AGT TCC G dqb1-73 NH2 - TGT GGG CTC CAC TCG CCG CTG CAA G dqb1-74 NH2 - CGG CGT CAG GCC GCC CCT GCG GGG T dqb1-75 N H2 - TCG ACA GCG ACG TGG AGG TGT ACC G dqb1-76 NH2 - GCG TTG GAG GCT TCG TGC TGG GGC T dqbl -77 NH2 - CGG TGA CCC CGC AGG GGC GGC CTG A dqb1-78 NH2 - ATG GGA CGG AGC GCG TGC GTT ATG T dqb1-79 NH2 - CGG TGA CGC CGC TGG GGC GGC TTG A dqb1-80 NH2 - ACG GGA CGG AGC GCG TGC GTC TTG T dqb1-81 NH2 - TGA TAA GGC CAA GCC CAA GGA AGA T dqbl -82 NH2 - GAG ACT CTC CCG AGG ATT TCG TGT A dqb1-83 NH2 - CGT CGC TGT CGA AGC GCA CGT CCT C dqb1-84 NH2 - GAC TCT CCC GAG GAT TTC GTG TAC C
APPENDIX 2
Homozygotes
(From CFT if available, otherwise greedy algorithm).
DPA1
TGCCCAGGGCACAG
TAAGGAAAAGGCTC
TTGGATCTGGACAA
TCTGGCCCAGCTCC
TTTGTACAGACCCA
TAGGGGACCCTGTG
TGGCGGACCATGTG
TCTGCTCATCTTCA
TGTCAACTTATGCC
TTCAGGCCGCCAAT DPB1
TT TTCAACCGGGAGGAG
TTGGCCTGACGAGGA
TTCAACCTGGAGGAG
TTT TTTCCAGTACTCCTC TTT TTTGCCGTAACTGGT
TTTTGGGGCGGCCTGA TTGCGCGTACTCCTC
TTTT TTTGGACAGGAGGAA TTTT TTCACAGGAGGAGCA
TTTT TTTTGCTCCTCCTGT TTGGCAATGCCCGCT
TTTT TTGGCACTGCCCGCT TTTT TTAGAGAATTACGTG TTTCCAGAGAATTAC
TTT TTAACTACGAGCTGG TT TTGGTCATGGGCCCG
TTTT TTTGACCCTGCAGCG TTTT TTTACACGTAATTCT TTGTAACTGGTACAC TTCTGACGAGGAGTA
TTTT TTTTACCTTTTCCAG TTTT TTCCTGGAAAAGGTA
TTTT TTGAGAATTACCTTT TTTT TTGCCTGACGAGGAG
TTTT TTACTGGTGCACGTA
TT- TTTCCTCCAGGATGT
TT TTCGGGAGGAGCTCG
TTAGCCAGAAGGACA
TTTT TTCAGCCAGAAGGAC
TTAGTGCCGGACAGG
T TTATTGCCGGACAGG
TTCCTGCAGCGCCGA
TTTT TTAGAGAATTACCTT TTTT TTGGACTCGGCGCTG
TTTT TTACTACGAGCTGGG
TTGCTTCGTGCTGGG
ΠTT TTGTCCCTGGTACAC ΓΠT TTGCGCTGCAGGGTC
DQA1
TTACATCCTCATCTG
TTACACCCTCATCTG
TT TTCAAGTTTACACCA
TTCAGCCACAATGTC
TTTCCAAGTCTCCCG
TT TTCGGGAGACTTGGA
TT TTAATTCATGGCTGT
TTACAATCCCAGGGC
TTACAACCCCAGGGC
TT TTGTGGGCATTGTGG
TTCCAACACCCTCAT
TTGGCCCACAGACAA
TTCATGGGCATTGTG
TT TTGGCCTGGATGAGC
TT ΓTAGGCTCATCCAGG
TT ΓTCAACACCCTCATT
TT TTAGCACTGGGGACT
TTAAGGGCCATTGTG TTTTAAATTCATGGGTG ΓCACCATAAGAGGC ΓCACCACAAGAGGC ΓCACCGTAAGAGGC I I I I ICCTCCCTTCTG TTTTAACTCTCCTCAG TTTTAAATCTCATCAG TTTCTCCTCCCTTCTG
DQB1
TTTATCTTGCAGAGGA TTTCCTCTCCAGGATG TTTGGGTCACCGCCCG TTTGGGAGTTCCGGGC TTTCGCTCGGGTCCTC TTTCCAGTACTCGGCG TTTCTGGGGCCGCCTG TTTATGTCTACACCTG TTTAAAGGGCTTCTGC TTTAGCATCACCAGGA TTTGCCAGGAGGAGAC TTTACCAGGAGGAGAC
TTT Gl GTTTCGGAATGA
TTTGGGTGTATCGGGT
TTTGTCGGAAAGGGCT
TTTTGGTTTCGGAATG
TTTCCAGTACTCGGCA
TTTAGCGCACGATCTC
TTTGTCTCTTCCTGGT
TTTCGTCAAGCCGCCC
TTTGCGTCAAGCCGCC
TTTCAAGGTCGTGCGG
TTTCGGTTATAGATGT
TTTTGTAACCAGACAC
TTTGTATGCAGACACA
TTTCACACCCCGCACG
TACACCCCGCACGC
DRB1
TTTGCAAGTCCTCCTC ΓTΠTCTCCTCCCGGT ΠTCCACAACCCGGTA ΓTTGGCCAGGTGGACA ΓTTGCGGTTCCTGGAG ΓTTCAGCCAGAAGGAC ΓTTGACTCGCCTCTGC ΓTTTCCAGGACTCGGC
TT ΓGAAATAACACTCA
TT ΓTGGAGGACAGGCG
TTTACGTGGTCGGGTG
TTTTACTCCAAGAAAC
TTTACGGTGTCCACCT
TTTGGAGAGGTTTACA
TTTCCAGTACTCGGCA
TTTGGAGTACTCTACG
TTTGTGTAAACCTCTC
ΓTTCGGTGCAGCGGCG
TTTGGAGGAGTTCCTG
TΠTGGAAGACGAGCG
TTTCAGGAGGTTGTGG TTTGACAGGCGCGCCG
TTTCCGTTCAGGAACC
TTTGGAATCCTCTTGG
TTTGCCACAAGAAACG
TTTACGTTTCTTGGAG
TTTCGGACTCCTCTTG
TTTTACGGGTGAGTGT TTTCCAGGAGGAGTTC TTTGTAATTGTCCACC
TTTTCGTAGCGCGCGT TTTAAGATGCATCTAT
TTTTACGTCTGAGTGT
TTTCCAGTACTCAGCA
TTTCGTAGCGCGCGTA
TTTATCTCTCCACAAC
TTTGAGCTCCTCCTGG
TTTAACCAGGAGGAGT
TTTAGGGCCCGCCTGT
TTTGGAGAGCTTCACA
TTTGGAGAGATTCACA
TTTTCACCGCCCGGTA
TTTAACTACCGGGTTG
TTTCCAGTACTGGGCA
DRB345
TTTGTATCTGTCCAGG
TTTGACTGGGGTGGTG
TTTCTGTCGAAGCGCA
TTTGTGTAAACCTCTC
TTTCTGTGAAGCTCTC
TTTCACCAGGGCCCGC
TTTGGCCAGGTGGACA
TTTGCGGTTCCTGGAG
ΓTTTCGAAGCGCGCGT
TTTTAACCAGGAGGAG
TTTACGTGGTCGGGTG
TTTAGGGCCCGCCTGT
TTTGGGCCCGCCTGTC
TTTAACTACGGAGTTG
TTTGGGGCCGGGCTGT
TTTGACCATGTTTCTT
TTTCTGTGCAGGAACC
TTTGGCCGGGCTGTTC
TTTACATCCTGGAAGA
TTTCTCACGAGTCCTG
HLA-A
TTTTCAGTCTGTGAGT
TTTAGACGCATATGAC
TTTGGACGCATATGAC
TTTGGTCGCCAGGTCC
TTTCCGCAGGCTCTCT
TTTTCCTCCTCCACAT
TTTCCGAACCCTCGTC
TTTATTTCTCCACATC
TTTGGCGGACATGGCG
TTTCCAGAGCGAGGAC
TTTTTCACCACATCCG
TTTGGGAGCCTGCCCA
ΓΠTGATGTGGAGGAG I I I I I I I I I I I I GGAGGAGGAACAG I I I I I I I I I I I IAGTCATATGCGTC I I I I I I I I I I I I GGTCTGCCCGAGC I I I I I I I I I I I IAAACCTGCCATGT I I I I I I I I I I I I CCGGGACACGGAA I I I I I I I I I I I I CGTCCTGGGGGGG I I I I I I I I I I I I CCGCTGCCAGGTC I I I I I I I I I I I I ATGCGTCCTGGGG I I I I I I I I I I I lATGCGTCTTGGGG I I I I I I I I I I I I GGAGAAGAGATAC I I I I I I I I I I I I GGGAGCCCGCCCA I I I I I I I I I I I I CCGCAGGTTCTCT I I I I I I I I I I I I GCGCAGGTCCTCT I I I I I I I I I I I I GGGCGGGCTCTCA I I I I I I I I I I I I CCAGGACACGGAG I I I I I I I I I I I I CCGGCAGTGGAGA
I I I I I I I I I I I IAGGAGACAGGGAA
I I I I I I I I I I I I GTCAATCTGTGAG
I I I I I I I I I I I IAGAAGTGGGTGGC I I I I I I I I I I I I CAGGTAGGCTCTC
I I I I I I I I I I I I CGGACGCCCCCAA I I I I I I I I I I I I I CAATCTGTGAGT
I I I I I I I I I I I I I GAAGGCCCAGTC I I I I I I I I I I I I CGTCGTAAGCGTC I I I I I I I I I I I I AACCAGAGCGAGG I I I I I I I I I I I I I GACGGTCATGGC I I I I I I I I I I I I I GGACCTGGCGAC I I I I I I I I I I I I GAGAGCCCGCCCA I I I I I I I I I I I I I CATATTCCGTGT I I I I I I I I I I I I GGGAGACACGGAA I I I I I I I I I I I I GTCCACTCGGTCA I I I I I I I I I I I I CCGTGTCTCCCCG I I I I I I I I I I I I GCTGCCACGTGGG I I I I I I I I I I I I CGAACTGCGTGTC I I I I I I I I I I I I GGTAGGCTCTCAA I I I I I I I I I I I IAGGTCCACTCGGT I I I I I I I I I I I I GTCCTGGGGGGGT I I I I I I I I I I I I GCTGCTCCGCCGC I I I I I I I I I I I I GGGGCGCCATGAC I I I I I I I I I I I I GCGCGATCCGCAG I I I I I I I I I I I I GCACATGGCAGGT I I I I I I I I I I I IAGGAGAAGAGATA I I I I I I 1 I I t I IAGGAGCAGAGATA I I I I I I I I I I I I CCACTCCACGCAC I I I I I I I I I I I I CCCGTCCACGCAC I I I I I I I I I I I I CACGTGCCATCCA I I I I I I I I I I I I CCCGGCCCGGCAG I I I I I I I I I I I I CACGTCGCAGCCA I I I I I I I I I I I IACGTCGCAGCCAT I I I I I I I I I I I IACGTGGCAGCCAT I I I I I I I I I I I IATCCAGAGGATGT I I I I I I I I I I I I CGAGCTCCGTGTC I I I I I I I I I I I I ACCAGAGCGAGGA I I I I I I I I I I I l ATGAACAGCACGC I I I I I I I I I I I I I CACACCCTCCAG I I I I I I I I I I I I CTACGTGGACAAC HLA-B
TTTGGATGGCGCCCCG TTTCGGCTCAGATCTC TTTCGGGGCGCCGTG TTCTCCACTGCTCCG TTTGTGTTGGTCTTG TTGGGTATGACCAGT TTTCCAGGTGATGTA TTGTCCTGCTCCGCC TTTGTAGTAGCGGAG TTGCTCAGGTCCTCC TTACCAACACACAGA TTCCGTCGTAGGCGT TTGTGAGCCTGCGGA TTACATCATCCAGAG
TTGGTTCTCTCGGTA
TTTGATGTGTCTCTC
TTGCGCCATGACCAG
TTGGCGTCCTGGTCA
TTAGGAGGACCTGAG
TTGCGCCAGGCACAG
TTAGGAGGGGCCGGA
TTCCGCTGCTCCGCC
TTACACCATCCAGAG
TTCACACAGATCTAC
TTGGGCATGACCAGT
TTCACACAGATCTCC
TTGCGAGTGCGTGGA
TTTGGTACCCGCGGA
TTCCTGTGCGTGGAG
TTAGACACAGATCTT
TTCAGCGACGCCACG
TTCGGGCCGGGACAC
TTCCCGTCCCAATAC
TTGGGCATAACCAGT
TTGCCCCGCTTCATC TTCAGGAGCGCAGGT ΓTCGTCCACGCACAG TTGAGTCCGAGAGAG ΓTGACACAGATCTCC
T TAACCAGTTAGCC
TTTAGGCGTGCTGGT ΓTGACCCTGCTCCGC TTGGGGCTCCGCAGA
TCCGGTCCCAATAC TGCGGGTCACGGCG
TTAGGGCCAGGGCTC
TTATCCTCTGGAGGG
TTGGCAGACGATGTA
TTAGGCGGAGCAGGA
TTCAGCTGCTCCGCC
TTATCTGCGGAGCCA
TTCGGAGCTGTGGTC
ΓTCGACCACAGCTCC
TTGAAGAGTTCAGGT
TTCATGTCGCAGCCA
ΓTCTGGGCTGGCTCC
ΓTCAACACACAGACT
ΓTTGGCGGAGCAGGA TTTATGACCAGGACG
TTCCACTGCTCCGCC
TTATGACCAGGACGC
TTGGAGGGGCCGGAG
TTTGCGTGGACGGGC
TTAGATCTGTATCTC
TTGCGGGTCATGGCG
TTCCGGGACATGGCG
TTCCACAGCTGTCCA
TTCGGGACATGGCGG
TTCCCGTCCACGCAC
TTGAAGTGGGAGCCG
TTTTCCCAATCCACC
TTCCCACGATGGGGA
TTTTCCCAGTCCACC
TTGAGATCTGAGCCG TTTCCACGCACTCGC TTGACAGCGACGCCA TTCGCCGCGGACACC TTGTAGGAGGAAGAG TTCTTTTCCACCTGA TTCACGTCGCAGCCA TTCAGGTCGCAGCCA TTCGTAGCCCACTGC TTATCCAGGTGATGT TTTCCCAATCCACCG TTGGGCGCTTCCTCC TTCCCGCTTCATCGC TTCCCCGCTTCATCG TTCACACAGACTTAC
T AGGACGGTTCGGG
TTCCCCGAACCGTCC
TTGAGCTCTTCCTCC
TTGCTCCCGAGAGCA
TTACTCCATGAGGCA
TTGCTGTGGTGGTGC
TTTTGTCCAGAAGGC
TTTGCCCGCGGAGGA
TTGCCGCGGACAAGG
TTCCGCCTTGTCCGC
TTCGGGTACCACCAG
HLA.
TTTGAGCTGGGAGCC
TTGGTGCAGGGCTCC
TTGGGTGCAGGGCTC
TTGAGGCGGAGCAGC
TTACGGCGGAGCAGC
TTGCGGCGGAGCAGC
TTAGCGCGCGGAACC
TTCGGCCCAGGTCTC
ΓΓTGGCTCCCAGCTC
TTGCGCGCGGAACCC
TTACGGCTTCCATCT
TTGGTTCGGGGCTCC
TTACTCCACGCACAG
TTTGGAGCAGGAGGG
TTGCGCGCAGAACCC
TTT TTTGAGTCTCTCATC TTT ΓTCCTGCAGCCCCTC 1H99V99W90VJLLL 001001009W00111 09W0900VW90111
0991090VWI I I I I I 11
9901110010001 I I I 11 ςς
109019W00V01 I I I 11
9JLL0099V0901VUJ
VH9199001V0V1H
0010V100V10V9ULI
0991090WV1V9111 OS
00V11V00999W111
99901100V99V9JL11 11
10V100V1000VI I I I u
91V9V9V091090111
W09V19019W9111 u 9P
99W900010V91 I I I LL
991001V9111V9111
9V9910VW0999ULL
V191LL9099900LLL 11
000W110V90 I I I I Ot?
0V1W10091919LU LL 9910VlV9111V011i LL 1911V000000VI I I I 09V91199V909011i LL 19110000V091 I I I I ζ£
V11W990LLV1LU_1
V910900910V9V111
V911900910W9111
91099W090099111 LL
19091009009V I I I I LL oe
VN*J S9L
90910VW99091 I I I
9W9991W9W9111 11
9V00101101091 I I I 11
Figure imgf000046_0001
V9W99V99V1V0111 90H9V00W1VI I I I 11
010999V00999V1LL
09V0V99V10010111
9910019109910111 03
V0V99V009V0V9111
0991V191V1W9111
V9V99V90W0V9LLL
H9V00V91V199JLLL 11
H9V00V91V099UL1 ζ\
1009900019190111 11
900009V9999V9111 11
9V09001919009111
099V0V090V001111 LL
V101V9919V001 I I I 01
009V090V99109111 99V9999909V191LL LL 9V0V099V00909111 LL 9V0V099V00900UL1 LL 9099V19019009111 11 0900009V99990ULL V191V1W9V001 I I I LL 0900191910900111 0900191900900111
W - £9εo/oo a xo<ι 880S9/00 OΛV rCTAATACCCGGAG
ACTTTCAGTGGGG CTGCGTGAAGTCG AATAGCCCACCAA AACGGAAACGGGG GGATTGCACTCTG TAGCCTTGGGGAG CGCCGCATGGCTG GCATAAGGGGCAT TACCACATCTCTG GTTACCGCGAGGA GGCTTTCAGAGAT ΓCGCTGCTTCGCTG TAGCGCTACCTTG GCACCACCTGTCA TGAGTTTTAACCT CTAATACGGGATA AGGAGAAAGCTTG TTAAGAGATTAGC GTAGCATTCTGAT AGGCTTTCCCCCA AGAAGTAGCTTGC TCGCGTATCATCG TTCAGAGATTAGC TCCGAAAGCGTGG TACAACCCGAAGC TGTCATGGCTCAG CGTAGGCTTGGTG GTGGAATTCCACG ACGGTTCCCGAAG AACTCGAGTGCGT TGATGTGCTATTA
AAGCAGGGAGGAA
CTGCTGCAGTGAA
TTGGGATTAGCTC
CCTTTGATACTGG
ΌGACGCTAGCGGC
ΌTTTACTACCCAC
ΌGCGATCTCTAGC
TAGGCCGTTCCCC
ACGCGTTGCATCG
GCCCGTCAAGCCA
AGTCCCCGCCATT
CTAGCCGTAAGGG
TGTCCTTCGGGGG
AACCAACTCCCAT
'ACTGTGGGTAATA
CTGAAAGATGGCG
CGAAAGCCAGGGG
GTCCGGAATTCTG
CAGAAGTGGGTAG
TCAGTCCTCATGG
GAAAGAAGCTTGC
GACCACCTGTCAC
TTTGGAACTGCAT
ACAGTTCCCGAAG
CTCATATCTCTAC
TTCAGTGAGGAAG
ACTGTGAGGAAGG
CCCAGCCCGTAAG ΌGTAGCCTTGGTG
ΆTGATGCGTAGCC
'AGGCAGTGGCTCA
CAGGACTTAACCC
GGCCAGGCCGTAA
TCCAACTTCGTGC
GAAGCGTGTGTGA
CTCCCCCGAAGGT
ATGGGAGTTTGTT
GTGTGCCGTTACC
AGCAGTGAGGAAT
GCCCCGGTTAACT
GCACCGGCAGTCA
GGACCTTCCTCTC
ACCTAGGTGGGAT
AATAGCTAATACC
GCCATATCTCTAC
GCCGGTGGGGTAA
TACCCCACCTTCG
CAAGGCCTGGGAA
CAACCCTGGTGGC
CTAGTCATCCAGT
GGCTGCTGCCTCC
CCCAGAGCTCAAC
GAAAGCTTGATCC
AACACGCTGGCAA
GAGCTTGCTCCCC
ATTTAGTTGAGCA
CGACTTAGGCTCA
TTGATGTGCTATT
CTTAGGTGCCAGC
GGCTACAGATCGT
AACTTGCGTGCAT
GCGATTACGTCAA
GGACGTTGGCGGC
TGGTGGAGCATGT
ATAAACCATGCGG
AAGAAGTGGGTAG
AACAAGCTAATCC
TCCATGGTTTGAC
AGTAACTGCCGGT
CAAAAGGGGGCGT
GGCGCTTGCGCTC
GCTACCTACGTGC
TGCGAGGTGGAGC
CGCGAGGTGGAGC
GCTACCTACTTCT
TTAACACATACAA
TGTTGTGAAATGT
CGTAAAACTCAAA
TCAAGGGGCAAGT
TCCAACCTTGCGG
GGAGGAACGTGGG
ATAAGCCTCTCAG
TATGCTAATCCCA
GATGCTAATCCCA
GCCAGTGTTCGTC
GTAAAGGTGGGGA
TTAACACACCGCC
CCAAGGCGGTGAT GCTACGGCTAACT
AGTCGAGCACTCT
AAGGGTAGCTAAT
GTCACAGTACGAG
TGAAAGCACTTTA
GGCGCAAGGCTTA
GCCTAGGTGGGAT
GTCCCCACGTTCC
GGCCACAAGGGGA
CTAGCTGTAGGGA
GTGGGCAGCAAGC
TCGAAAGATTAAA
GGAGTATGGTCGC
CGAGATGTGAAAG
GGGCAGGCTAGAG
ACCTCCTGAGCCA
TCCACCGCTACAC
TTTCAGTCTTGCG
CTTGACGGGCGGT
ACGGTAAAAGATG
TTCACCCTTGCGG
TAACCAGAAAGCC
CAACCAGAAAGCC
GTGTCAAAGGCAG
TAAGTCCGGATTG
GCGACATGCTGAT
ATCAGCCTGCCGC
GTCGGTAGGGTAA
GTCGGTGGGGTAA
CAACTCATAAGGG
TTCACTGCTTAAA
CGCCAGTCCCACC
CTAGTCATAAGGG
CACTGATTTGACG
GGCCACACAGGGA
TTTCCCCCATTGT
TGACCAGAAAGGG ACACTGGGGGATA TCAGCCGCCTTCG ΓGTCGCCAGCTCGT CTCATATGAATTG TGTAAAGGGAGCG CGTAAAGGGAGCG GGCGGCTCCCTCC CAGATGTTCCTCC GTCTCACGACACG TCAGCCGCCTACG TTGTGCTAATACC CTTGGAACTGCAT AGTACTCACCCGT ATTGCTCCATCAG TGATCCTGAGCCA AGCAAGTAGAACG TGCAAGTAGAACG GATAACCGCAAGG GCAAGCGTTTTCC GAATACCTCCTTT ACAGAGCTTTACA TGTCCTTCGGGAG AGGCGGCTTGCTG Heterozygotes
From CFT if available, otherwise greedy algorithm.
DPA1
TTGCCCAGGGCACAG
TTCTGTTGTTCTATG
TTAAGGAAAAGGCTC
TTATGAAGATGAGCA
TTCACCCTCAGTGAC
TTGTCAACTTATGCC
TTGCAGGAAGAGGCT
TTTTTGTACAGACGC TTCGGTCTCCTTCTT TTGCAATGGGGAGCC TTTGGATCTGGATAA π ΓGATGAAGATGAG
ΓTTTGTTTGTACAGAC TTCGTTTGTACAGAC TTCTCAGGCCGCCAA TTCTCAGGCCACCAA TTATGTGGATCTGGA TTACACTCAGGCCGC TTCACACTCAGGCCG TTTCAGGCCACCAAC TTCGTCTGTACAAAC TTAGAACATCTCATC TTAGAACTGCTCATC
TTTTGAATTTGATGA
TTTTGAGTTTGATGA
DPB1
TCAACCGGGAGGAG
TCAACCTGGAGGAG
TCATCCTGGAGGAG
TTGCTGGGGGGTCA
TGGCCTGACGAGGA
AACTACGAGCTGG
TTCCAGAGAATTAC
TTGCCGTAACTGGT
TTCCAGTACTCCTC
TAGTGCCGGACAGG
TACCCCCCAGCAGG
TAGAGAATTACGTG
TTCCAGTACTCCGC
TGCATTCCTGCCGT
TCGGGAGGAGCTCG
TCAGCCAGAAGGAC
TATTGCCGGACAGG
TCTGCAGCGCCGAG
TGCGCGTACTCCTC
TACAGAATTACCTT
TTTAAGTGTACCAG
TATCCTGGAGGAGA
TGGTCATGGGCCCG
TGGGAGGAGTACGC
TTGGGGCGGCICTGA AAAAGGTAATTCT
CTGCCGTAACTGG
TTGTGTCTGCATA
GGCTGTTCCAGTA
GTCCCTGGTACAC
CCTGCAGCGCCGA
TCTTGGAGGGGGA
GAGGTCCTTCTGG
CAACCGGCAGGAG
TGTGTCTGCATAC
CGGGAGC.AGTTCG
TGACCCTGC-AGCG
CAGAGAATTACCT
TGGGTAGAAATCC
TTACGTGCACCAG
CGCTGCAGGGTCA
AGCCAGAAGGACA
GTTCCAGTAGTCC
GGCCTGCTGCGGA
TGCAGCGCCGAGG
ACTACGAGCTGGT
CTGGGGCGGCCTG
ACAGCGACGTGGG
TGCCGGACAGGAT
CTGCCGTCCCTGG
CATGGGCCCGACC
GTCCCATTAAACG
GTAACTGGTACAC
AAGGACCTCCTGG
CTCCTGGAGGAGA
GAGAATTACGTGT
CCTGATGAGGTGT
CACAGGAGGAGCA
TGCCGTCCCTGGT
"GGGAGGAGTTCGC
TGGACAGGAGGAA
ACCCTGCAGCGTC
CCGCCCGGAACTC
TTT "GCTGCAGGGTCAC CAGGACTATCCA
GCGTACTCCTGCC
CCGTAACTGGTGC
GCAGGAATGCTAC
"CCAGGCAGCATTC
AACCGGGAGGAG
TGGCCTC.AGGCGGA
ACTACGAGCTGGG
ATGAGGTGTACTG
ATACATCTACAAC
TAACTGGTACACT
CACGTAATTCTCT
AGCATTCCTGCCG
ACTGGTACACTTA
GGCAATGCCCGCT
GCTTCGTGCTGGG
CGCCCGGAACTCT
ACAGGACTGTCCA
TCCTCCAGGAGGT
CCTTCTGGCTGTT
GTTCCAGTACTCC 0101001910V9V1 09
IVOIOOOVOWOOI
0O1O11V00OO1V1LL.
991911V0999191
0999V0000W0V1
OOOOVOOOIWOVI ςς
OIOIWOVOOOVOI
OOLLVOOOOIOVO
O1O1V01001VOV1
0W009V0V0W01
0W099V9W1V01 05
OWOOOVOWOVOl iNoα
0100V01V91000LL ςp 00W0V99W0V0LL 0V0V10V00LL91LL 9100100V99W9LL 91001V0V99W9LL 191199000100111 ot? V099190V900V0LL 9W00V0010101LL
OV000000VO1W1
V0V19910W1 0LLL.
01011W190V0VLL sε
9109000010V00LL
01V1V99V0V000LL
01V1000V0V090LL
VLLOV00V0O1 0LL
0100W0V00W0LL oε
001001000101VLL
000010LLV0101LL
0W00V0010010LL
V0V1O01000190LL
VIOOWWOOIOOLL ςz
W10009100LLVLL
010019V19V001LL
0V19991000100LL
0W0W0010000LL
WOOIOOOIOOOOU oz
WW09100010011
100100109000011
OlOOlOWlOOOO i
00100100V00V0LL
0100119V99V99LL si
9V00 1 I I I OOVLLLL
V0V00V1910010LL 00V0V10V00110LL VLLW9V9V0010LL
10V00V00000WLL 01 101V09V001001LL
19V99V000V0WLL
00V0019V90000LL 00V11W0V9V00LL OOVLLWOVOVOOLL
OIOOOVOOIOOOVU 0W1O00 100LLLL
V0V00V99100WLL 01099V0010000LL
09
C9€0/00d3/lOd 880S9/00 OΛV TCCAACATCCTCAT TGGCCCACAGACAA
ΓT TCATGGGCATTGTG TAACATCCTCATCT TCAACACCCTCATT TGACTGTGGTCTGC TAGCACTGGGGACT TCTTAGATTTGACC TTTTAGATTTGACC TCGATGTTCAAGTT TCAATCCCAGGGCG TCCTCGGATGATGA TTCCACATAGAACT TAAATTCATGGGTG TCAGCCACAATGCC TCACCATAAGAGGC TTTCCTCCCTTCTG TAACTCTCCTCAG TTAAATCTCATCAG TCTCCTCCCTTCTG TGTCAGCCACAATG TTCATTCCTTCTTC TCTTCCTCCCTTCT TTTTTT I I I I I l ATAACTCTCCTCA TGAGGCTCATCCAG TC.AGGCTTGTCCAG TATGTTGACCACAG TAGTGCCCACCACA TGAACATCCTGATT TGGACCTGGAGAAG TCCCTCTGGCCAGT TCCCTCTGGGR-AGT TTTACACCGTAAGA TAGAAGATTTGACC TGAACTGGCCAGAG TGCTACAACTCTAC TCAGTCTTACGGTC TCAGTCTTATGGTC
DQB1
TATCTTGCAGAGGA
TGGCTGGGGTGCTC
TGGGTCACCGCCCG
TCTGGGGCCGCCTG
TCTCGGCGCTAGGC
TGTATCTGGTCACA
TAACTACGAGGTGG
TCCAGTACTCGGCG
TCGGTTATAGATGT
TGCAAGTCCTGGAG
TTGGACACAACGCC
TCTGGGGCTGCCTG
TGGCCTTAAACTGG
TTGTGTCTGCATAC
TGTCGGAAAGGGCT
TGGGTGTATCGGGT
TCCAGTACTCGGCA
TGTAGACATCTCCA
TAGGAAACGGGCGG CACACCCCGCACG
TT CCGCTCGGGTCC AGCATCACCAGGA CCAGTTTAAGGGC ATAGCCACAAGGA GTATGCAGACACA TCCAGTACTCGGC AGCGCACGATCTC GGACATCCTGGAG
TT GGGGCTGCCTGA GTCAGAAAGGGCT CAGGAGCCCTTTC TGTCTCTTCCTGG ACACCCCGCACGC TGGTTTCGGAATG AACGGGACAGAGC GCTGGGGCCGCCT GAGGATTTCGTGT GAGAGGAGTACGC CACATCAAAGTCC GCCAGGAGGAGAC GTACTCGGCGGCA π TTTTTCGCCAGTTGTCTC AGGGGGGTGGACA AGATGTATCTGGT TGGGGGAGTTCCG TGTCTCCTCCTGG CACACTCTGTCCA GGAATGATCAGGA ATGGGGTCGCCGC CAGATCAAAGTCC AACGGGACCGAGC AGGAGTACGTGCG ATGTGACCAGATA AGGGGCGGCCTGT CGCCGGTTGTCTC TGTAACCAGACAC GTGAAGTAGCACA AGCGGCGACCCCA CACACCCTGTCCA GTGTGACCAGATA TGGACCTTCCAGA ATCGGGTGGTGAC GTTTAAGGGCCTG TGAAGTAGCACAG GCTCCAACTGGTA CCTTAAACTGGTA AGGAGGACGTGCG TCGTGCTGGGGCT CGCTGCTGGGGCT CCAAGGAAGATCA ACCGCGCGGTGAC GCCCTTAAACTGG
GGTCACACCCCG GGGAGTTCCGGGC AGGAGGAGACAAC GGGTGGACACAAC TCTGCTCGGTGAC TGGGGCGGCTTGA GCGCACGTCCTCC TAGGATTTCGTGTA
TTTTT TGCCTTAAACTGGA
DRB345
GTACCTGGACAGA
GTTCCTGGAGAGA
ACACTCATACTTA
ACACTCAGACTTA
TCCTGGAGCAGGC
TCGAAGCGCGCGT
AATCTGCACAGAG
AGGGCCCGCCTGT
AGGACACTCTGGA
GTGTAAACCTCTC
CTGTCGAAGCGCA
GGGGCCGGGCTGT
TCTTCCAGGATGT
AACTACGGAGTTG
CAAGAAACATGGT
TAACCAGGAGGAG
TGAAGCTCTCCAC
GGGGCGGCCTGTC
GCGGCGCGCGTGT
TTTCTTGGAGCTG
TTCTCTTCCTGGC ACTACGGGGTTG
"GTATCTGATCAGG
GGCCAGGTGGACA
"GCCCCAGCTCCGT
GGTTCCTGGAGAG
GTCGAAGCGCACG
"GTGTCTGCAGTAG
GCTCCACTTGGCA
TACGGGGTTGGTG
CGGTTCCTGCACA
TCCAGTACTCGGC
TGTCCACCTCGGC
TCTTCCTGGCCGT
GGTGTCCACCAGG
ACTCCGTAGTTGT
CACTCAGACTTAC
GATGCTAGAAACA
GTGGAATGGAGAG
TAACCAAGAGGAG
GTTCCGGAATGGC
GTATCTGCAGTAG
ACCTCCTGGTCTG
AGCCAACAGGACT
GCGGTTCCTGCAG
CGCGCCGCGGTGG
GTAAACCTCTCCA
CTGATCAGGCTCC
TCCAGGACTCGGC
AACCATTCACAGA
CGGGCCCTGGTGG
GTTCCGGAACGGC
GCGGCCCGCCTGT
TCCTGGAAGACAC
GCCGGGTGGACAA TCTGCTCCAGGATG
TCAACTACTGCAGA
TGTACCTGGAGAGA
TACCTCTCCACTCC
TGTGAAGCTCTCCA
TCCGCGGCGCGCGT
TCTGATCAGGTTCC
TAATGGGACGGAGC
TTATGGAAGTATCT
TTCTGCAGTAGGTG
TCGGGCCGCGGTGG
TCTGTGCAGGAACC
TCCAAGAGGAGGAC
TCAATTACTGCAGA
TCACCTACTGCAGA
TCTGCCTGGATAGA
TGTAATTGTCCACC
TCACCAGGGCCCGC
TTGCGGTACCTGGA
TCCTGCAGCACCAC
TGCGGCGCGCCTGT
TCCAGGACTCGGCA
TGACACAACTACGG
TGATACAACTACGG
TACTCAGACTTACA
TTGAGACTTACACA
TTACGGGGTTGTGG
TGTAGTTGTCCACC
TAACCAGGAGGAGT
TAACCAAGAGGAGT
TTCCACAGCCCCGT
TCAGCCAGAAGGAC
TGGAGGAGTTCCTG
TGAACTCCTCCTGG
TAACCACTCACAGA
TGGCCGGGCTGTTC TCTCACGAGTCCTG TGTCGAAGCGCAAG TCCTCCTGGTCTGT
HLA-A
TTCAGTCTGTGAGT
TCCGCAGGCTCTCT
TATGAGGTATTTCT
TGGACATGGAGGTG
TC-AGGTAGGCTCTC
TTACTCTTGGGGGC
TGGTCGCCAGGTCC
TGGGAGCCCGCCCA
TCCGCTGCTCCGCC
TTGAAGGCCCAGTC
TGCAGCCATACATC
TCCACTCCACGCAC
TCACGTCGCAGCCA
TGGTCTGCCCGAGC
TCAGGTAGACTCTC
TGGGAGACACGGAA
TCCCGTCCACGCAC
TGTCCACTCGGTCA TATCCAGAGGATGT
TCGCGATCCGCAGG
TCCGGGACACGGAA
TGGAGGAGGAACAG
TAAGTGAAGGCCCA
TGGGGCTTGGGGAG
TCAGACTAACCGAG
TGTCCTGGGGGGGT
TCGTCGTAAGCGTC
TAGGTCCACTCGGT
TGGTAGGCTCTCAA
TGCGCGATCCGCAG
TGTGTCCTGGGTCT
TATCC.AGATAATGT
TCCGTCGTAGGCGT
TTCATATTCCGTGT
TCGGACCCCCCCCA
TGCCGCATGGACCG
TGCTGCTCCGCCGC
TAGCGCAGGTCCTC
TCTACCTGGATGGC
TGGTATTTCTTCAC
TATATGAAGGCCCA
TCCGTGTCTCCCCG
TCCGGCAGTGGAGA
TCGGACGCCCCCAA
TCCGTGAGGCGGAG
TAGGAGACAGGGAA
TAGAGCGAGGACGG
TGCACATGGCAGGT
TCAGCTGCTCCGCC
TATGAACAGCACGC
TCCCGGCCCGGCAG
TGCAGCCTGAGAGT
TGACGGTCATGGC
TCCGTCGTAAGCGT
TGAGTATTGGGACC
TCTGGCCTGGTTCT
ΓTACCTCATGGAGTG
TAGCCGCCATGTCC
TCACGTGCCATCCA
TGGTCCCCAGGTTC
TAGGAGAAGACATA
TCTGCTGCTCCGCC
TTGACCCAGACCAG
TCGGGCGGAGCAGT
TAGGTTCGCTCGGT
TCATATGCGTCCTG
TCGTCCTGGGGGGG
TGCACGTGCGTGGA
TGGTATTTCTACAC
TAGGAGCAGAGATA
TCCCGAACCCTCGT
TGCCACATGGGCCG
TAGCAGGAGGAGCC
TATCCAGATGATGT
TGGATGGGGAGCAC
TGC.ACTGGCGCTTC
TAGCTTGTAAAGTG
TGATAATGTATGGC ΓTCACACCCTCCAG
ΓCTACGTGGACAAC
ΓCGAGCGAACCTGG
ΓCGAGACAGCCTGC
ΓGGGCTACGTGGAC
ΓACCACCAGTACGC
ΓGAGGATGTATGGC
TGATCTCAGCCGCC
TGATCTGAGCTGCC
TGATGATGTATGGC
TATACCTGGAGAAC
TGATGTATGGCTGC
TTCCGCAGGTTCTC
TGAGCAGAGATAAA
TGGGCTGGGAAGAC
TGATGGGCAGGACT
TTCACTTTCCCTGT
TCCCACGATGTGGA
TAGTCATATGCGTT
TGGCGGACATGGCG
TGCTCCGCCTCACG
TCGTCGTAAGCGTT
TGATC.ATGTTTGGC
TCACGGACGCCCCC
TGCTCCTCCTGCTC
TACTCACCGAGTGG
TAGTCATATGTGTC
TGGTCTGAGCTGCC
TTCCCACTTGCGCT
TGCCCACTCACAGA
TGGCTCAC.ATCACC
TGCTCTTGGACCGC
TGAGAGCCTGCGGA
TGGAACACACGGAA
TCGGAACACACGGA
TCGTAAGCGTCCTG
TGCCGGTGCGTGGA
TGCCGCATGGGCCG
TCCAGAGCGAGGAC
TCCCAACGGGCCGC
TCGAGTGCGTGGAG
TGCGAACCTGGGGA
TCGGGTACCAGCGG
TTGAAGCGGGGCTC
TGGCGGCCCGTTGG
TTCTGGGTCAGGGC
TGCCTCATGGGCCG
TCCATCCCGCTGCC
TAGCTCAGACCACC
TGTCGTAAGCGTCC
TCCCGGCCGCGGGA
TGGTCCCAATACTC
TCGTCCCAATACTC
TGTTCTCACACCAT
TTCCTCTGGATGGT
TTCCCACTTGTGCT
TCCTGACCCAGACC
TTGAGAGCCCGCCC
TGAGTGCGTGGAGT
TTACATCATCTGGA LL9V00VO1V1O01 09 9W91V90010V01 V000V000199V01 OOVOOOOLLlOVll OOIOOIOOOOWOI 0V0199010V0101 ςς 119V00V91V0991 V191V9919V00LL 09900001099001 19V00V91V00991 010009W000091 09
0010000001199 1 I I I I I LL I I I I
0101099V199V01
000V0V0000V991
0V0009000V10V1
00O0WOV0OO 01 ςp
0099V190100001 V0V090V0O10V1
0V001000V0V011
9V00V99919V991
010000101009V1 OP
V00OVOWOVO1V1
000V999100V011
O-VΠH ζ£
0V00V00100V101
000V000V0010V1
0191900100V001
019190010W001
OV010100WOV11 oε
0991V0V9909901
00999010000091
19V010101W011
001019190011V1
Figure imgf000059_0001
1000V991010011
109W9910LL0V1
001V0VW00OV01
00901000V00001
00V0W09V99V91 oz
0100001000V011
OOIOIOIOOOLLLI
9999LL010001V1
999910010001V1
00001V01900091 ς\
00009W99V0001
V001V9999V0911
19VOOV00000011
1V0V0010010011
V00101V01V0W1 01
V9V9191991V991
01V0V0V000V001
0V990V0V999001
OV OV0VOOV001
W000000000001
V0V919V991V011
V99900V0001001
9V9V99V00V9V11
0LL00V00001V01
19
€9£0/00d3/_LOd 880S9/00 OM TTACAGCCAGGCCAG
TTGAGGCGGAGCAGC
TTTGGTTGTAGTAGC
TTACCTGCGGAAACT
TTCGGCCCAGGTCTC
TTGCTGGACGCAGCC
TTCAGGTTCCGCAGG
TTCCGCCAGGCACAG
TTCCTCCTACACATC
TTACGGCGGAGCAGC
TTAGCGCGCGGAACC
TTTTCACTCGGTCAG
TTACGCCGCGAGTCC
TT I I I I I I TGGAGCAGGA.GGG
TTGGGTATGACCAGT
TTATACCTGGAGAAC
TTGGGTTCGGGGCTC
TTGACCGCTAGGACA
TTATCTGAGCCGCTG
TTCGCGGAGAGCCCC
TTCCTGGCGCTTGTA
TTCCTGCGGAAACTA
TTAGCGTCTCCTTCC
TTTGGCGCCCCGAAC
TTATGATGTGAGACC
TTCTCGGTGTCCTGG TTGTAGTAGCCGCGT TTAGGATGTGAGACC TTGGTAGGCTCTCTG TTAGCGTCTTCTTCC TTCATAGGAGGAAGA TTGACAACCAGGACA TTGCCGCGGGGAGCC TTGGTGAGGGGCTCT TTCGAGGGGCTGCCA TTGGGTATAACCAGT πΓCCAGAATATGTA
TTGGGTGCAGGGCTC
TTCGCGCGGAACCCC
TTTAGTAGCCGCGTA
TTAGCTGCTCTCAGG
TTACCGCACGAACTG
TTCCGCAGGCTCACT
TTGGTGTGAGACCCG
TTTGGAGCCCCGAAC
TTAGCCGCGGGAGCC
TTACTGCACGAACTG
TTCCGCACGAACTGT
TTGGTGCAGGGCTCC
TTGCAGCAGGAGC.AG
TTTGAGTCTCTCATC
TTCCGCCGTGTCCGC TTTCCACGCACAGGC TTACTCGGTCAGCCT TTCACACC.ATCCAGA TTCACACCCTCCAGA TTGCAGCAGGATGAG
TTCAGCCACCACAGC TTTCGTGGCTGGCCT TTTACGGCGGAGCAG TCTCACACCATCCA
TTGCGGCGGAGCAG
TTCTGAGCCGCCGT
TGGCGGAGCAGCAG
TCCGCTGCGGACAC
TTATAACCAGTTCG
TCACATCCTCCAGA
TCCGTGTCCGCGGC
TCGTGGACGACACA
TCCGCTGTGTCCGC
TGAAGAATGGGAAG

Claims

1. A method of identifying a set of extendible primers for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms wherein: i) all possible nucleotide sequences of a chosen length of the nucleic acid are identified and their corresponding extendible primers, ii) at least one extendible primer is removed from the set wherein the at least one primer removed identifies a segment of the nucleic acid identified by at least one other primer.
2. The method of claim 1 , wherein between steps i) and ii): ia) potential extensions for each primer are identified with respect to each nucleotide sequence, ib) for each extendible primer the identified potential extensions are compared to determine which pairs of sequences can be discriminated by the primer.
3. The method of claim 1 or claim 2, wherein a matrix of primers and pairs of primer extensions is prepared in binary form and is subjected to analysis by a set covering problem (SCP) algorithm.
4. The method of claim 3, wherein a greedy algorithm is used.
5. The method of claim 3, wherein a CFT algorithm is used which involves a Lagrangrian relaxation heuristic.
6. The method of any one of claims 3 to 5, wherein a set of core primers is selected as a base for analysis by the SCP algorithm.
7. The method of any one of claims 3 to 6, wherein the set of extendible primers identified by the SCP algorithm is subjected to a redundancy check.
8. A set of extendible primers, for use in the identification, typing or classification of a nucleic acid of known sequences having known polymorphisms, identified by the method of any one of claims 1 to 7.
The set of extendible primers of claim 8, in the form of an array.
10. The set of extendible primers of claim 8 or claim 9, for use in the identification, classification or typing of an organism, allele or gene selected from class 1 HLA, class 2 HLA and 16S rRNA.
11. The set of extendible primers of any one of claims 8 to 10, wherein the primers are arrayed on a surface of a support in such a way that recognisable patterns are formed with different types or alleles.
12. A set of extendible primers, for use in the identification, typing or classification of a human leucocyte antigen (HLA) gene as indicated, the set comprising about the number of primers indicated and being capable of distinguishing about the number of alleles indicated:
HLA gene Number of Number of
Alleles Primers
Class I HLA-A 91 172
HLA-B 200 <1000
HLA-C 47 94
Class II DPA-1 11 26
DPB-1 74 130
DQA-1 17 130
DQB-1 34 84
DRB-1 192 <1000
DRB345 35 94
13. A set of extendible primers, for use in the identification, typing or classification of 16S rRNA, wherein set comprises about 210 primers and is capable of distinguishing at least about 1207 different sequences.
14. The set of extendible primers of claim 12 or claim 13, wherein the primers have variable segments substantially as set out in appendix 1 or appendix 2.
15. A method of identification, typing or classification of a nucleic acid of known sequence having known polymorphisms, by the use of the set of extendible primers as claimed in any one of claims 8 to 14, which method comprises applying the nucleic acid or fragments thereof to the set of extendible primers under hybridisation conditions, and effecting template-directed chain extension of extendible primers that have formed hybrids.
16. The method of claim 15, wherein the set of extendible primers is provided in the form of an array, and template-directed chain extension is effected using labelled chain-terminating nucleotide analogues.
17. The method of claim 16, wherein template-directed chain extension is effected using four different fluorescently-labelled chain terminating nucleotide analogues, and the results are analysed by total internal reflection fluorescence or confocal microscopy.
18. The method of any one of claims 15 to 17, wherein the nucleic acid is a PCR amplimer.
19. The method of any one of claims 15 to 18, wherein the nucleic acid is HLA Class 1 or HLA Class 2 or 16S rRNA or a PCR amplimer thereof.
20. The method of any one of claims 15 to 19, wherein a dUTP/uracil-DNA-glycosylase system is used to break the nucleic acid into fragments.
21. A kit for use in the identification, typing or characterisation of a nucleic acid of known sequence having known polymorphisms, comprising the set of extendible primers as claimed in any one of claims 8 to 14.
22. The kit of claim 21 , comprising also a pair of primers for effecting PCR amplification of the nucleic acid.
23. An array of sets of extendible primers as claimed in any one of claims 8 to 14, for the simultaneous identification typing or classification of two or more different HLA genes.
24. A computer readable storage medium having a program recorded thereon, wherein the program consists of instructional steps for identifying a set of extendible primers for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms, the steps comprising: i) identifying all possible nucleotide sequences of a chosen length of the nucleic acid and their corresponding extendible primers. ii) removing at least one extendible primer from the set wherein the at least one primer removed identifies a segment of the nucleic acid identified by at least one other primer.
25. Computer readable program implement consisting of instructional steps for identifying a set of extendible primers for use in the identification, typing or classification of a nucleic acid of known sequence having known polymorphisms, the steps comprising: i) identifying all possible nucleotide sequences of a chosen length of the nucleic acid and their corresponding extendible primers, ii) removing at least one extendible primer from the set wherein the at least one primer removed identifies a segment of the nucleic acid identified by at least one other primer.
PCT/EP2000/003636 1999-04-26 2000-04-20 Primers for identifying typing or classifying nucleic acids WO2000065088A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU50625/00A AU5062500A (en) 1999-04-26 2000-04-20 Primers for identifying typing or classifying nucleic acids

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP99303215.0 1999-04-26
EP99303215 1999-04-26

Publications (2)

Publication Number Publication Date
WO2000065088A2 true WO2000065088A2 (en) 2000-11-02
WO2000065088A3 WO2000065088A3 (en) 2001-08-09

Family

ID=8241346

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2000/003636 WO2000065088A2 (en) 1999-04-26 2000-04-20 Primers for identifying typing or classifying nucleic acids

Country Status (2)

Country Link
AU (1) AU5062500A (en)
WO (1) WO2000065088A2 (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001036679A2 (en) * 1999-11-15 2001-05-25 Hartwell John G METHODS FOR GENERATING SINGLE STRANDED cDNA FRAGMENTS
WO2001092572A1 (en) * 2000-06-01 2001-12-06 Nisshinbo Industries, Inc. Kit and method for determining hla type
WO2002029659A1 (en) * 2000-10-04 2002-04-11 International Reagents Corporation Medical system with the use of dna chips
WO2003066893A1 (en) * 2002-02-04 2003-08-14 Vermicon Ag Methods for specific rapid detection of pathogenic food-relevant bacteria
US6713257B2 (en) 2000-08-25 2004-03-30 Rosetta Inpharmatics Llc Gene discovery using microarrays
WO2004029289A2 (en) * 2002-09-26 2004-04-08 Roche Diagnostics Gmbh Analysis of the hla class i genes and susceptibility to type i diabetes
EP1536021A1 (en) * 2003-11-27 2005-06-01 Consortium National de Recherche en Genomique (CNRG) Method for HLA typing
US7507568B2 (en) 2002-09-25 2009-03-24 The Proctor & Gamble Company Three dimensional coordinates of HPTPbeta
US7589212B2 (en) 2006-06-27 2009-09-15 Procter & Gamble Company Human protein tyrosine phosphatase inhibitors and methods of use
US7622593B2 (en) 2006-06-27 2009-11-24 The Procter & Gamble Company Human protein tyrosine phosphatase inhibitors and methods of use
US7632862B2 (en) 2002-09-25 2009-12-15 Procter & Gamble Company Pharmaceutical compositions that modulate HPTPbeta activity
US7795444B2 (en) 2006-06-27 2010-09-14 Warner Chilcott Company Human protein tyrosine phosphatase inhibitors and methods of use
US7807447B1 (en) 2000-08-25 2010-10-05 Merck Sharp & Dohme Corp. Compositions and methods for exon profiling
EP2371865A2 (en) 2006-04-07 2011-10-05 Warner Chilcott Company, LLC Antibodies that bind human protein tyrosine phosphatase beta (HPTP-ß) and uses thereof
US8569348B2 (en) 2009-07-06 2013-10-29 Aerpio Therapeutics Inc. Compounds, compositions, and methods for preventing metastasis of cancer cells
EP2722395A1 (en) 2001-10-15 2014-04-23 Bioarray Solutions Ltd Multiplexed analysis of polymorphic loci by concurrent interrogation and enzyme-mediated detection
US8846685B2 (en) 2006-06-27 2014-09-30 Aerpio Therapeutics Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US8883832B2 (en) 2009-07-06 2014-11-11 Aerpio Therapeutics Inc. Compounds, compositions, and methods for preventing metastasis of cancer cells
AU2012200697B2 (en) * 2001-05-07 2015-02-19 Agriculture Victoria Services Pty Ltd Modification of plant and seed development and plant responses to stresses and stimuli (4)
US9096555B2 (en) 2009-01-12 2015-08-04 Aerpio Therapeutics, Inc. Methods for treating vascular leak syndrome
US12043664B2 (en) 2011-10-13 2024-07-23 EyePoint Pharmaceuticals, Inc. Methods for treating vascular leak syndrome and cancer
CN118506875A (en) * 2024-07-12 2024-08-16 中国科学院心理研究所 Method, apparatus, medium and program product for the preferred design of RNA viral primers

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995013396A2 (en) * 1993-11-11 1995-05-18 U-Gene Research B.V. A method for identifying microorganisms, and aids useful thereof
WO1995015400A1 (en) * 1993-12-03 1995-06-08 The Johns Hopkins University Genotyping by simultaneous analysis of multiple microsatellite loci
US5883238A (en) * 1993-03-18 1999-03-16 N.V. Innogenetics S.A. Process for typing HLA-B using specific primers and probes sets
WO1999019509A2 (en) * 1997-10-10 1999-04-22 Visible Genetics Inc. Method and kit for amplification, sequencing and typing of classical hla class i genes

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08308596A (en) * 1995-03-10 1996-11-26 Wakunaga Pharmaceut Co Ltd Detection of hla

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5883238A (en) * 1993-03-18 1999-03-16 N.V. Innogenetics S.A. Process for typing HLA-B using specific primers and probes sets
WO1995013396A2 (en) * 1993-11-11 1995-05-18 U-Gene Research B.V. A method for identifying microorganisms, and aids useful thereof
WO1995015400A1 (en) * 1993-12-03 1995-06-08 The Johns Hopkins University Genotyping by simultaneous analysis of multiple microsatellite loci
WO1999019509A2 (en) * 1997-10-10 1999-04-22 Visible Genetics Inc. Method and kit for amplification, sequencing and typing of classical hla class i genes

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
BEIN G ET AL.: "Rapid HLA-DRB1 genotyping by nested PCR amplification" TISSUE ANTIGENS, vol. 39, 1992, pages 68-73, XP000949259 *
BUNCE M ET AL.: "The PCR-SSP manager computer program: A tool for maintaining sequence alignments and automatically updating the specificities of PCR-SSP primers and primer mixes" TISSUE ANTIGENS, vol. 52, 1998, pages 158-174, XP000972134 *
CEREB N ET AL: "LOCUS-SPECIFIC AMPLIFICATION OF HLA CLASS I GENES FROM GENOMIC DNA: LOCUS-SPECIFIC SEQUENCES IN THE FIRST AND THIRD INTRONS OF HLA-A, -B, AND -C ALLELES" TISSUE ANTIGENS,DK,MUNKSGAARD, COPENHAGEN, vol. 45, 1995, pages 1-11, XP000197333 ISSN: 0001-2815 *
DATABASE WPI , 1997 Derwent Publications Ltd., London, GB; AN 059711 XP002133096 "Detection and typing of class I MHC HLA-DR antigens - can check multiple specimens easily and type all HLA-DR (D-related) antigens known to be present in the Japanese population" -& JP 08 308596 A (WAKUNAGA SEIYAKU KK), 26 November 1996 (1996-11-26) *
DOI K AND IMAI H: "Greedy algorithms for finding a small set of primers satisfying cover and length resolution conditions in PCR experiments" GENOME INF. SER., vol. 8, 1997, pages 43-52, XP000900076 *
LEVINE J E ET AL.: "SSOP typing of the tenth international histocompatibility workshop reference cell lines for HLA genes" TISSUE ANTIGENS, vol. 44, 1994, pages 174-183, XP000972173 *
LO V M: "HEURISTIC ALGORITHMS FOR TASK ASSIGNMENT IN DISTRIBUTED SYSTEMS" IEEE TRANSACTIONS ON COMPUTERS,US,IEEE INC. NEW YORK, vol. 37, no. 11, 1 November 1988 (1988-11-01), pages 1384-1397, XP000005083 ISSN: 0018-9340 *
METSPALU A ET AL.: "Arrared primer extension (APEX) for mutation detection using gene specific DNA chips" AMERICAN JOURNAL OF HUMAN GENETICS, vol. 61, no. 4Sup, 1997, page A224 XP000900002 *
OLERUP O ET AL.: "HLA-DQB1 and -DQA1 typing by PCR ampification with sequence-specific primers (PCR-SSP) in 2 hours" TISSUE ANTIGENS, vol. 41, 1993, pages 119-134, XP000972084 *
PIRRUNG M C ET AL.: "Design and use of a solid phase DNA-based computational device to solve satisfiability (SAT) problems" FASEB JOURNAL, vol. 11, no. 9, 1997, page A1214 XP002133095 *
VASKO F J: "An efficient heuristic for large set covering problems" NAVAL RESEARCH LOGISTICS QUARTERLY, vol. 31, 1984, pages 163-171, XP000890158 cited in the application *

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001036679A2 (en) * 1999-11-15 2001-05-25 Hartwell John G METHODS FOR GENERATING SINGLE STRANDED cDNA FRAGMENTS
WO2001036679A3 (en) * 1999-11-15 2001-11-22 John G Hartwell METHODS FOR GENERATING SINGLE STRANDED cDNA FRAGMENTS
WO2001092572A1 (en) * 2000-06-01 2001-12-06 Nisshinbo Industries, Inc. Kit and method for determining hla type
US6713257B2 (en) 2000-08-25 2004-03-30 Rosetta Inpharmatics Llc Gene discovery using microarrays
US7807447B1 (en) 2000-08-25 2010-10-05 Merck Sharp & Dohme Corp. Compositions and methods for exon profiling
WO2002029659A1 (en) * 2000-10-04 2002-04-11 International Reagents Corporation Medical system with the use of dna chips
AU2012200697B2 (en) * 2001-05-07 2015-02-19 Agriculture Victoria Services Pty Ltd Modification of plant and seed development and plant responses to stresses and stimuli (4)
EP2722395A1 (en) 2001-10-15 2014-04-23 Bioarray Solutions Ltd Multiplexed analysis of polymorphic loci by concurrent interrogation and enzyme-mediated detection
WO2003066893A1 (en) * 2002-02-04 2003-08-14 Vermicon Ag Methods for specific rapid detection of pathogenic food-relevant bacteria
US7507568B2 (en) 2002-09-25 2009-03-24 The Proctor & Gamble Company Three dimensional coordinates of HPTPbeta
US7632862B2 (en) 2002-09-25 2009-12-15 Procter & Gamble Company Pharmaceutical compositions that modulate HPTPbeta activity
US7769575B2 (en) 2002-09-25 2010-08-03 Warner Chilcott, LLC Three dimensional coordinates of HPTPbeta
WO2004029289A2 (en) * 2002-09-26 2004-04-08 Roche Diagnostics Gmbh Analysis of the hla class i genes and susceptibility to type i diabetes
WO2004029289A3 (en) * 2002-09-26 2004-07-22 Roche Diagnostics Gmbh Analysis of the hla class i genes and susceptibility to type i diabetes
EP1536021A1 (en) * 2003-11-27 2005-06-01 Consortium National de Recherche en Genomique (CNRG) Method for HLA typing
US8435740B2 (en) 2003-11-27 2013-05-07 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for HLA typing
WO2005052189A3 (en) * 2003-11-27 2005-10-20 Consortium Nat De Rech En Geno Method for hla typing
US20120157347A1 (en) * 2003-11-27 2012-06-21 Commissariat A L'energie Atomique Method for hla typing
WO2005052189A2 (en) * 2003-11-27 2005-06-09 Consortium National De Recherche En Genomique (Cnrg) Method for hla typing
US7820377B2 (en) 2003-11-27 2010-10-26 Commissariat A L'energie Atomique Method for HLA typing
EP2267159A3 (en) * 2003-11-27 2011-08-10 Commissariat à l'Énergie Atomique et aux Énergies Alternatives Method of typing of HLA-B
JP2007512014A (en) * 2003-11-27 2007-05-17 コンソルシャム ナショナル ドゥ ルシェルシュ アン ジェノミック(セーエヌエルジェー) HLA typing method
JP2011200244A (en) * 2003-11-27 2011-10-13 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for hla typing
EP2272986A3 (en) * 2003-11-27 2011-11-02 Commissariat à l'Énergie Atomique et aux Énergies Alternatives Method for HLA typing
US9926367B2 (en) 2006-04-07 2018-03-27 Aerpio Therapeutics, Inc. Antibodies that bind human protein tyrosine phosphatase beta (HPTPbeta) and uses thereof
EP2371865A2 (en) 2006-04-07 2011-10-05 Warner Chilcott Company, LLC Antibodies that bind human protein tyrosine phosphatase beta (HPTP-ß) and uses thereof
EP3252079A1 (en) 2006-04-07 2017-12-06 Aerpio Therapeutics, Inc. Antibodies that bind human protein tyrosine phosphatase beta (hptp-ss) and uses thereof
US11814425B2 (en) 2006-04-07 2023-11-14 Eye Point Pharmaceuticals, Inc. Antibodies that bind human protein tyrosine phosphatase beta (HPTPbeta) and uses thereof
US8524235B2 (en) 2006-04-07 2013-09-03 Aeripo Therapeutics Inc. Method for treating coronary artery disease using antibody binding human protein tyrosine phosphatase beta(HPTPbeta)
US8106078B2 (en) 2006-06-27 2012-01-31 Warner Chilcott Company, Llc Human protein tyrosine phosphatase inhibitors and methods of use
US7622593B2 (en) 2006-06-27 2009-11-24 The Procter & Gamble Company Human protein tyrosine phosphatase inhibitors and methods of use
US8329916B2 (en) 2006-06-27 2012-12-11 Aerpio Therapeutics Inc. Human protein tyrosine phosphatase inhibitors and method of use
US7589212B2 (en) 2006-06-27 2009-09-15 Procter & Gamble Company Human protein tyrosine phosphatase inhibitors and methods of use
US8258311B2 (en) 2006-06-27 2012-09-04 Aerpio Therapeutics Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US8846685B2 (en) 2006-06-27 2014-09-30 Aerpio Therapeutics Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US10463650B2 (en) 2006-06-27 2019-11-05 Aerpio Pharmaceuticals, Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US8895563B2 (en) 2006-06-27 2014-11-25 Aerpio Therapeutics, Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US8946232B2 (en) 2006-06-27 2015-02-03 Aerpio Therapeutics, Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US8188125B2 (en) 2006-06-27 2012-05-29 Aerpio Therapeutics Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US8338615B2 (en) 2006-06-27 2012-12-25 Aerpio Therapeutics Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US9126958B2 (en) 2006-06-27 2015-09-08 Aerpio Therapeutics, Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US7795444B2 (en) 2006-06-27 2010-09-14 Warner Chilcott Company Human protein tyrosine phosphatase inhibitors and methods of use
US9795594B2 (en) 2006-06-27 2017-10-24 Aerpio Therapeutics, Inc. Human protein tyrosine phosphatase inhibitors and methods of use
USRE46592E1 (en) 2006-06-27 2017-10-31 Aerpio Therapeutics, Inc. Human protein tyrosine phosphatase inhibitors and methods of use
US9096555B2 (en) 2009-01-12 2015-08-04 Aerpio Therapeutics, Inc. Methods for treating vascular leak syndrome
US9174950B2 (en) 2009-07-06 2015-11-03 Aerpio Therapeutics, Inc. Compounds, compositions, and methods for preventing metastasis of cancer cells
US9949956B2 (en) 2009-07-06 2018-04-24 Aerpio Therapeutics, Inc. Compounds, compositions, and methods for preventing metastasis of cancer cells
US8883832B2 (en) 2009-07-06 2014-11-11 Aerpio Therapeutics Inc. Compounds, compositions, and methods for preventing metastasis of cancer cells
US8569348B2 (en) 2009-07-06 2013-10-29 Aerpio Therapeutics Inc. Compounds, compositions, and methods for preventing metastasis of cancer cells
US12043664B2 (en) 2011-10-13 2024-07-23 EyePoint Pharmaceuticals, Inc. Methods for treating vascular leak syndrome and cancer
CN118506875A (en) * 2024-07-12 2024-08-16 中国科学院心理研究所 Method, apparatus, medium and program product for the preferred design of RNA viral primers

Also Published As

Publication number Publication date
WO2000065088A3 (en) 2001-08-09
AU5062500A (en) 2000-11-10

Similar Documents

Publication Publication Date Title
WO2000065088A2 (en) Primers for identifying typing or classifying nucleic acids
US6703228B1 (en) Methods and products related to genotyping and DNA analysis
Slate et al. Gene mapping in the wild with SNPs: guidelines and future directions
Ellegren Sequencing goes 454 and takes large‐scale genomics into the wild
Sanchez et al. A multiplex assay with 52 single nucleotide polymorphisms for human identification
CA2771330C (en) Methods and materials for canine breed identification
Kenta et al. Multiplex SNP‐SCALE: a cost‐effective medium‐throughput single nucleotide polymorphism genotyping method
US20100261189A1 (en) System and method for detection of HLA Variants
Lee et al. Microarrays: an overview
KR20060103813A (en) Virtual representations of nucleotide sequences
EP1056889B1 (en) Methods related to genotyping and dna analysis
US20140141436A1 (en) Methods and Compositions for Very High Resolution Genotyping of HLA
CN101360834B (en) Identify method and the probe of nucleotide sequence
CN108138226B (en) Polyallelic genotyping of Single nucleotide polymorphisms and indels
Garosi et al. Defining best practice for microarray analyses in nutrigenomic studies
Absalan et al. Molecular inversion probe assay
US20110053789A1 (en) Mircoarray methods
US20040023275A1 (en) Methods for genomic analysis
US20060234244A1 (en) System for analyzing bio chips using gene ontology and a method thereof
Lockhart et al. DNA arrays and gene expression analysis in the brain
US20080026367A9 (en) Methods for genomic analysis
US20230074085A1 (en) Compositions, methods, and systems for non-invasive prenatal testing
Sun et al. Development of 21 Microsatellite Loci and Diversity Analysis of Amur Grayling in Amur River
US20040126800A1 (en) Regulatory single nucleotide polymorphisms and methods therefor
Tebbutt Genotyping of single nucleotide polymorphisms by arrayed primer extension

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP