WO2000044937A1

WO2000044937A1 - Method and arrangement for determining one or more restriction enzymes for analyzing a nucleic acid or nucleic acid sequence

Info

Publication number: WO2000044937A1
Application number: PCT/NL2000/000056
Authority: WO
Inventors: Augustinus Franciscus Maria Simons; Ignace Joseph Isabella Lasters; Mark Jan Jozef Van Haaren
Original assignee: Keygene N.V.
Priority date: 1999-01-29
Filing date: 2000-01-28
Publication date: 2000-08-03
Also published as: EP1147220A1; AU2467800A

Abstract

A method and an arrangement for determining one or more restriction enzymes for analyzing a nucleic acid sequence of a sample belonging to a species, in accordance with the steps of: a) reading first data relating to a plurality of restriction enzymes specifying at least recognition sequence and cutting pattern per restriction enzyme from a first database; b) reading second data relating to at least a representative number of nucleic acid sequences of the species from a second database; c) determining one or more of the restriction enzymes that, if applied to the nucleic acid sequences, would result in restriction fragments having a size within a user defined window; d) presenting the one or more restriction enzymes to a user.

Description

Method and arrangement for determining one or more restriction enzymes for analyzing a nucleic acid or nucleic acid sequence

Field of the invention The invention relates to a method and arrangement for determining one or more restriction enzymes for analyzing a nucleic acid or nucleic acid sequence.

The invention further relates to a method and an arrangement for analyzing a nucleic acid or nucleic acid sequence using these one or more restriction enzymes thus determined.

Prior art

Genotyping and/or DNA-fingerprint techniques, such as AFLP and RFLP, usually involve restricting a starting nucleic acid sequence (such as genomic DNA or cDNA) with a specific combination of one or more selected restriction enzymes (referred to as the "restriction enzyme combination" or "REC"), in order to generate a set of specific restriction fragments, which are then (optionally selectively) amplified and analyzed, for instance for the presence of polymorphisms.

Given the total number of available restriction enzymes, there are currently more than three thousand possible restriction enzyme combinations, not all of which will provide equally useful results for a given nucleic acid sequence. Finding a useful restriction enzyme combination is often still a matter of "trial and error", in which several restriction enzyme combinations are tried to determine which combination provides the most informative bands in the final fingerprint. Also, a restriction enzyme combination should preferably provide an "optimal coverage" of the nucleic acid sequence or of the nucleic acid sequences on which the given restriction enzyme combination is applied in a given experiment. The precise meaning of the concept "optimal coverage" is context dependent. We recognize two types of context. In the first context, referred to as "genomic DNA context", one is interested in the DNA fingerprinting of long DNA. Typically, but not exclusively, this DNA is of genomic origin. In the second context one is interested in the characterization by a suitable fmgerprinting technique of a large ensemble of different nucleic acid sequences. Since these sequences typically originate from the analysis of gene expression profiles, where, for experimental reasons, the cellular pool of messenger RNA is converted into cDNA using the enzyme reverse transcriptase, we refer to this second context as "cDNA context",

In experiments of the type "genomic DNA context", such as in genomic AFLP finge rinting, where long stretches of DNA are treated with a restriction enzyme combination, the term "optimal coverage" refers to a choice of a restriction enzyme combination which is such that as much as possible of the input DNA is retrieved in DNA fragments as can be detected by the used analysis system. For example, a given analysis system may impose a size-range on the fragments it can detect or resolve. A possible size range could be, for example, 50 to 500 nucleotides. In this example, DNA fragments from a size range of 10 to 1000 nucleotides and that have a size, expressed as the number of nucleotides in a DNA fragment, smaller than 50 or larger than 500 may not be detectable. As a consequence, we seek in this example a restriction enzyme combination, that maximizes the fraction of input DNA that yields detectable DNA fragments in the size-range 50-500 nucleotides.

In experiments of the type "cDNA context", such as in cDNA AFLP fingeφrinting, where usually a large collection of different DNA sequences are analyzed, the term "optimal coverage" refers to a choice of a restriction enzyme combination which maximizes as much as possible the number of different input DNA sequences that can be detected given the constraints of the used analysis system. For example, a given analysis system may impose a size-range on the fragments which it can detect or resolve. A possible size range could be, for example, the size range 50 to 500 nucleotides. In this example, DNA fragments that have a size, expressed as the number of nucleotides in a DNA fragment, smaller than 50 or larger than 500 may not be detectable. As a consequence, we seek in this example a restriction enzyme combination, that maximizes the different number of DNA sequences that yield one or more DNA fragments in the size-range 50-500 nucleotides. In addition, the given analysis system may also impose other constraints. For example, the analysis system may be an electrophoresis system which, due to the resolving power of such system, imposes an upper limit on the number of different fragments that can be well resolved within a given size range. Since an input DNA sequence may yield, upon digestion with a restriction enzyme combination, more than one DNA fragment it may be desirable to choose the optimal restriction enzyme combination such that at the same time the number of different sequences that can be detected within the given size range are maximized as well as the least number of fragments per different input sequence are produced.

Summary of the invention

Object of the invention is to provide a method and an arrangement for determining a priori a suitable restriction enzyme combination for analyzing a given nucleic acid sequence, such as cDNA or genomic DNA, on the basis of available information on the nucleic acid sequence, such as the nucleotide sequence of the entire nucleic acid, or only part thereof.

The invention therefore relates to a method for determining one or more restriction enzymes for analyzing a nucleic acid sequence of a sample belonging to a species, comprising the steps of: a) reading first data relating to a plurality of restriction enzymes specifying at least recognition sequence and cutting pattern per restriction enzyme from a first database; b) reading second data relating to at least a representative number of nucleic acid sequences of the species from a second database; c) deternrining one or more of the restriction enzymes that, if applied to the nucleic acid sequences, would result in restriction fragments having a size within a user defined window; d) presenting the one or more restriction enzymes to a user.

Through this method one or more optimal restriction enzymes are identified to maximally cover different DNA molecules in, e.g., a fingeφrint technology. To our knowledge no such approach has presently been described. There exist a computer program, named PRIMER (S.E. Lincoln, M.J. Daly and E.S. Lander from the MIT center for Genome Research and the Whitehead Institute for Biomedical Research) that offers a computational procedure to design primers to amplify specific regions of sequenced DNA. But this program does not address at all the question addressed by the method of the invention where one of the question may, e.g., be to design fingeφrint experiments that cover maximally the different DΝA molecules in a given DΝA sample.

In one embodiment, the invention relates to a method that comprises the further step of analyzing the nucleic acid sequence of the sample using one or more of the restriction enzymes determined in step c). Generally, such a method will comprise at least one step of restricting the starting nucleic acid sequence with the restriction enzyme(s); optionally one or more steps of (often selectively) amplifying the restriction fragments thus obtained; followed by one or more steps of analyzing the mixture of amplified restriction fragments thus obtained, such as by hybridization or any kind of electrophoresis/autoradiography to generate a DΝA fingeφrint. The invention also relates to an arrangement for determining one or more restriction enzymes for analyzing a nucleic acid sequence of a sample belonging to a species, comprising processor means, memory means connected to the processor means for storing data, input means connected to the processor means for inputting data and instructions to the processor means, the arrangement being programmed to carry out the steps of: a) reading first data relating to a plurality of restriction enzymes specifying at least recognition sequence and cutting pattern per restriction enzyme from a first database; b) reading second data relating to at least a representative number of nucleic acid sequences of the species from a second database; c) deterrnining one or more of the restriction enzymes that, if applied to the nucleic acid sequences, would result in restriction fragments having a size within a user defined window; d) presenting the one or more restriction enzymes to a user. The invention also relates to a computer program product to be loaded by a computer arrangement and comprising data and instructions for determining one or more restriction enzymes for analyzing a nucleic acid sequence of a sample belonging to a species, in accordance with the steps of: a) reading first data relating to a plurality of restriction enzymes specifying at least recognition sequence and cutting pattern per restriction enzyme from a first database; b) reading second data relating to at least a representative number of nucleic acid sequences of the species from a second database; c) determining one or more restriction enzymes that, if applied to the nucleic acid sequences, would result in restriction fragments having a size within a user defined window; d) presenting the one or more restriction enzymes to a user.

Moreover, the invention relates to a computer readable data carrier provided with a computer program product as defined above.

Further aspects of the invention reside in the restriction enzyme (s) and possible combinations determined by the method of the invention.

The restriction enzyme(s) determined by the method of the invention can be used in any known genotyping, DNA-fingeφrinting or similar nucleic acid analysis technique, in which a restriction enzyme or combinations of enzymes are used. Examples are RFLP, AFLP, and the DNA subtraction technique described in EP-A-0419 571. In these techniques, the one or more restriction enzymes determined by the method of the invention are used in a manner known per se for the technique used.

With the method and the arrangement of the invention, one or more suitable restriction enzymes can be determined for any nucleic acid sequence to be analyzed, including but not limited to genomic DNA, cDNA, or RNA (which is first "converted" - i.e. by the algorithm- into a corresponding DNA sequence). The sequence can be single stranded or double stranded; if the starting sequence is single stranded, it is usually "converted" by the algorithm into the corresponding double stranded sequence.

The starting sequence is preferably genomic DNA or cDNA, such as from the genome of a plant, animal, micro-organism, virus, yeast, fungus or algae, and/or any other nucleic acid sequence. Possibly these sequences may contain so-called ORF region which may encode for proteins.

With advantage, by the method and with the arrangement of the invention, the suitable restriction enzyme(s) can be determined even if only part of the total sequence or genome to be analyzed is known. For instance, the method and arrangement of the invention make it possible to predict/determine one or more suitable restriction enzymes for a given nucleic acid even when less than 20% of the total sequence is known (but at least as 2%, preferably more than 5 %, more preferably more than 10 %).

Both in the "genomic DNA context" as well as in the "cDNA context" the method of the invention allows to determine one or more restriction enzymes that gives the highest coverage (in percentages) even when less than 20% of the total sequence is known (but preferably more than 5 %, more preferably more than 10 %). Therefore only a small part of the total "genomic DNA context" or of the total "cDNA context" is sufficient to "simulate" the whole of the DNA context.

The method of the invention is based on the following features of the nucleic acid sequence to be analyzed, which can be used as input/parameters for the algorithm: nucleotide sequence; percentage GC and/or AT; number of open reading frames ( ORF); introns/exons; - known genes (e.g. structural genes); known regulatory sequences; any other string of (alternative) nucleosides.

The information on the starting sequence is preferably provided in a form suitable for processing by computer, such as in the form of a database, as a computer readable code or signal (analog or digital), and/or on a suitable data carrier, such as a computer disc or CD-ROM. Such information can e.g. be obtained by using an automated nucleic acid sequencer that has analyzed the nucleic acid sequence of the sample (or a part thereof) or from a similar species from a public database in which sequences have been deposited.

The method of the invention can be adapted to determine any desired restriction enzyme combination. For instance, the method and arrangement of the invention can be used to determine - for a given nucleic acid sequence such as a given genome, genomic DNA or cDNA - a suitable combination of two or more "frequent cutters" two or more "rare cutters", or one or more "frequent" and one or more "rare cutters", as well a combinations of one or more frequent or rare cutters with further restriction enzymes. A "frequent cutter" is a restriction enzyme, such as Msel, which has in view of its short recognition pattern a high probability to cut a DΝA molecule at any position. A "rare cutter" is a restriction enzyme, such as Eco RI, which in view of its longer recognition pattern has a lower probability to cut the DΝA at any position. More specifically, the invention can be used to determine a useful combination of: one or more restriction enzymes recognising 4 nucleotides (frequent cutters); one or more restriction enzymes recognising 5 nucleotides (i.e. including a

"wobble" nucleotide); - one or more restriction enzymes recognising 6 or more, and up to 8 or 10 nucleotides (rare cutters); one or more so-called "4.5" cutters; one or more methylation sensitive restriction enzymes; one or more restriction enzymes having recognition sites in the nucleic sequence different from/outside of the site actually restricted.

Also, the method and arrangement of the invention can be used to determine useful restriction enzymes that provide no overhang, and a 3'-overhang or a 5'-overhang, or a combination of such enzymes, again by accordingly selecting the proper option in the procedure which implements the method. The desired type and total number of restriction enzymes in the restriction enzyme combination can be predetermined and used as input/parameters for the algorithm. Also, the method and algorithm can predict/determine (a) suitable restriction enzyme(s) to be used in a combination with a given predetermined enzyme.

Preferably, the method and/or arrangement of the invention are used to deteimine a combination of a frequent and a rare cutter for AFLP, RFLP or a similar technique; a combination of two frequent cutters for AFLP or RFLP.

By the method and arrangement of the invention, a suitable restriction enzyme combination or a number of alternative combinations for the analysis of a given sequence or genome can be determined, or at least information is provided which directly points to a suitable combination. Also, the method and arrangement of the invention can provide or at least predict one or more of the following information/parameters: size of the restriction fragments generated (average and/or minimum/maximum) ; number of times a sequence is restricted (both totally, as well as for each restriction enzyme of the combination separately); a redundancy calculation, assessing, for any user-defined size-range, the number of fragments that are expected per fragment. Such information is in several respects practically useful. As a first example, by choosing a restriction enzyme combination that minimizes the redundancy. As a second example, in a DΝA sequencing project to judge the completeness of the sequencing work by comparing the theoretical expected redundancy (using the method of the invention) with the experimentally determined redundancy (i.e. the number of different fragments that have been detected matching to the same full DΝA sequence).

The output/parameters determined by the method and arrangement of the invention can, e.g., be provided by means of a suitable display or printer, as a database, a digital or analog signal or code, and/or on a suitable data carrier.

The method of the invention is particularly suitable for determining one or more restriction enzymes useful in methods for: genotyping and/or DΝA fingeφrinting of plant genomes such as tomato or maize, human, animal and microbial genomes. - transcript imaging using AFLP, such as by cDΝA-AFLP, i.e. high-throughput transcript profiling technology for microorganisms, such as yeast but also plant genomes, human, animal; expression and expression pattern analysis, i.e. fingeφrinting of messenger RΝA; following development or growth phases/stages; - "enhanced AFLP" (to be explained hereinafter); predicting a restriction fragment pattern (i.e. fingeφrint such as AFLP fingeφrint) for a specific primer-enzyme combination; management of a sequencing project, to predict the number of DΝA fragments, derived from larger DΝA sequences, that have to be sequenced in order to ensure that a certain percentage of these sequences is encountered in the sequencing process. Brief description of the drawings

The following experimental section will further illustrate the method and arrangement of the invention. The invention will be explained with reference to some drawings which are only intended to illustrate the invention and not to limit its scope as defined by the accompanying claims.

Figure 1 shows a typical example of output data in case of a " cDNA context" using a publically available yeast Orf database;

Figure 2 shows a typical example of output data in case of a " genomic DNA context" using a publically available yeast genomic database;

Figure 3 shows part of an example export file in a computed AFLP fingeφrint; Figure 4 shows a graph showing a relation between percentage of detected sequences (P_s) versus a given percentage of fragments being sequenced (P_N);

Figure 5 shows output data for checking whether or not reliable predictions can be made if only small fractions of the sequences to be analyzed are available;

Figure 6 shows a schematic computer arrangement for carrying out the method of the invention;

Figure 7 schematically shows analyzing a sample with two restriction enzymes in accordance with a special method referred to as "enhanced cDNA-AFLP"; Figure 8 schematically shows a device for analyzing a nucleic acid sequence sample.

Description of preferred embodiments

In figure 6, an overview is given of a computer arrangement th at can be used to carry out the method according to the invention. The arrangement comprises a processor 1 for carrying out arithmetic operations.

The processor 1 is connected to a plurality of memory components, including a hard disk 5, Read Only Memory (ROM) 7, Electrically Erasable Programmable Read

Only Memory (EEPROM) 9, and Random Access Memory (RAM) 11. Not all of these memory types need necessarily be provided. Moreover, these memory components need not be located physically close to the processor 1 but may be located remote from the processor 1.

The processor 1 is also connected to means for inputting instructions, data etc. by a user, like a keyboard 13, and a mouse 15. Other input means known to persons skilled in the art may be provided too. A reading unit 17 connected to the processor 1 is provided. The reading unit 17 is arranged to read data from and possibly write data on a data carrier like a floppy disk 19 or a CDROM 21. Other data carriers may be tapes, DVD, etc., as is known to persons skilled in the art.

The processor 1 is also connected to a printer 23 for printing output data on paper, as well as to a display 3, for instance, a monitor or LCD (Liquid Crystal Display) screen, or any other type of display known to persons skilled in the art.

The processor 1 may be connected to a communication network 27, for instance, the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide

Area Network (WAN), etc. The processor 1 may be arranged to communicate with other communication arrangements through the network 27. For example, through the network

27, input may be provided by Intranet or Internet.

The processor 1 may be implemented as stand alone system, or as a plurality of parallel operating processors each arranged to carry out subtasks of a larger computer program, or as one or more main processors with several subprocessors. Parts of the functionality of the invention may even be carried out by remote processors communicating with processor 1 through the network 27.

Detailed description of the used method.

The main flow of the method, as implemented by an algorithm nning on computer arrangement shown in figure 6 (e.g. controlled by software stored in hard disk 5), comprises the following steps.

Step 1. A database of restriction enzymes, e.g., provided on a data carrier 19, 21 or downloaded through the Internet or Intranet, is read into memory 5-11. Each restriction enzyme is characterized by its name, its recognition sequence pattern and the cutting positions, relative to the recognition sequence, in the top and bottom DNA strands. Clearly, the recognition sequence pattern may contain degenerate positions, i.e. positions where more than one nucleotide is compatible with the recognition pattern. In order to render the computational process efficient as well as logically transparent, the recognition pattern is stored as a string wherein all possible nucleotides at each position of the pattern are encoded by a single letter. For example the recognition pattern for BstYI is encoded as RGATCY where one recognizes that the first and last positions are degenerate with R={ A,G} and Y={C,T} .

Step 2. Subsequently, for each restriction enzyme the longest contiguous part of the recognition sequence that does not contain any degenerate position is determined, referred to as "core" of the recognition sequence, and stored into memory 5-11. For example, in BstYI the core recognition sequence is GATC. This core is introduced to implement a fast (see step 5) searching strategy of the recognition pattern of a given restriction enzyme.

Step 3. Often, the used restriction enzyme combination is composed of two different restriction enzymes where each restriction enzyme produces a specific type of DNA-end to which a, usually small, nucleic acid molecule, often referred to as adapter, can be specifically ligated. In the case, such as in AFLP fingeφrinting, where generally two different adapters are used it may be advantageous to identify a restriction enzyme combination such that each adapter can ligate only to one type of DNA-end thereby preventing so-called adapter scrambling in the ligation process wherein site-specific adapters are ligated to the restriction fragment. Therefore, a list is made containing pairs of restriction enzymes that may lead to scrambling of adapters. Per default, these pairs will be discarded from the analysis. For example, the pair Apol (recognition sequence RAATTY), EcoRI (recognition sequence GAATTC) is put in this list because both generate, after restriction enzyme digest, the overhang AATT. Hence upon ligating with for example an adapter which is intended to ligate specifically to say the EcoRI site scrambling might occur since this adapter will also recognize the Apol overhang. By eliminating these pairs from the analysis process two important goals are achieved. First, the user will not be tempted to perform experiments that will lead, due to scrambling, to useless and uninteφretable results. Second, the efficacy of computational process increases because irrelevant pairs are no longer taken into consideration. However, in the algorithm, the user has the option to impose that all restriction enzyme pairs, including those with compatible overhangs, are used in the coverage analysis. In other words, Step 3 may be omitted.

Step 4. Next, the user can specify a database of sequences of interest. This database carries also a semantic definition of the preferred context of usage, either "cDNA context" or "genomic DNA context". In case no known sequences are available, a population of random DNA fragments can be generated. This population can be assigned either a "cDNA context" or a "genomic DNA context". The user can specify (a) the number of fragments where the total number of fragments is limited only by the amount of memory that is available to store these fragments, (b) the parameters of a Gaussian length distribution of these random fragments (mean and standard deviation of the length) and (c) the %AT, %CG contents of the random sequence. Special care is taken to handle efficiently large databases of known sequences. Indeed, with the advent of high throughput sequence facilities, there is a rapid increase in the amount of sequence data that becomes available. Thus, special steps need to be taken by the algorithm (implementing the method) in order to keep the computational requirements feasible even on databases of large size containing possibly 10⁹ or more base pairs of DNA sequence. To increase the efficiency of the sequence processing, a computational procedure, referred to as PrepSeq, has been developed to parse any raw sequence file such that the annotation and sequence information fields are segregated in different structures. PrepSeq operates prior to applying the algorithm implementing the method. For each different sequence database, PrepSeq needs only be excecuted once. The annotation information and the indices to the corresponding sequences are read by the algorithm implementing the method and kept in memory. This information is used to provide additional qualifiers in some of the output files of the algorithm implementing the method. Thus while reading the DNA sequences, the algorithm implementing the method has no longer to parse annotation from sequence information. To allow that the algorithm implementing the method can be applied to arbitrary large databases, an important feature may be added. Preferably, the algorithm implementing the method will not carry out its analysis after first reading, and thus storing into memory the full sequence of a given DNA molecule. While such procedure is well suited to process short sequences, this approach will brake down for long sequences. First, the memory needs may become excessive. Second, the time needed to search restriction enzyme fragments into a long sequence may become excessive too. Therefore, the algorithm implementing the method can carry out its analysis concurrently with reading into memory a given sequence. More precisely, PrepSeq will chop a given sequence in segments of a predefined size. The segment size is taken large enough such that any restriction fragment, having a size comprised in a certain predefined size-range (typically corresponding to the size-range that is well observable in a given experimental set-up), can span at most two consecutive sequence segments. More precisely, if for example PrepSeq chops any sequence in segments of 2000 nucleotides then any restriction fragment that is detectable in say a standard gelelectrophoretic system will be contained within a single segment or will cross at most one segment boundary. Thus while reading a given segment, the algorithm implementing the method can immediately locate all possible restriction sites in this segment. While proceeding to the next segment, the algorithm implementing the method needs to memorize only the previous segment in order to treat properly border effects (e.g. a restriction site that is made up of nucleotides from two segments) and to search for restriction fragments that span the border of a given segment. While doing so, (a) the computational needs grow only linearly with the length of a given sequence (even if this sequence has hundreds of millions of bases), (b) at no moment a full sequence needs to be memorized, (c) the bookkeeping problem remains manageable since only the current and the preceding segments have to memorized. Clearly, in view of this strategy the algorithm implementing the method becomes also useful for genomic DNA applications where huge nucleic acid sequences are to be dealt with.

Step 5. We now compute all restriction fragments, that have a size within a user defined window, for all pairs of restriction enzymes stored in memory 5-11, except for the pairs that possibly have been discarded in Step 3. To compute these fragments, first the single restriction enzyme cutting sites are searched. To optimize the efficacy of this process, the recognition core pattern (which is devoid of degenerate nucleotides) is located by a direct substring search. Next for each hit of the core pattern, the surrounding, degenerate positions, if any, are compared with the sequence under consideration. A list of cut positions for each restriction enzyme is memorized. Then, for each possible pair of restriction enzymes, the lists with cutting positions (taking into account the present and the preceding segments) are merged and new restriction fragments are searched matching at their ends both restriction enzymes and satisfy the size criterion. If at least one such fragment is found in a given sequence, the sequence is considered to be covered by the given pair of restriction enzymes. If the algorithm implementing the method is asked to compute the expected fingeφrint for a specific pair of restriction enzymes, the sequence of each of the restriction fragments will be stored in memory for subsequent output to screen, disc file or other means of output or display.

Step 6. The procedure described in step 5 is repeated for all sequences in the specified database yielding a list of restriction fragments. Subsequently, the coverage is computed. In case DΝA context specified in step 4 is of the type "cDΝA context" the coverage is computed as Coverage = Ν / T x l00 . Ν = number of molecules that give raise to one or more restriction fragments of which at least one has a length that falls within a user-defined size range. This size-range usually corresponds to the size range that can be well analyzed in a given detection system. T = the total number of different molecules contained in the studied DNA sample. Also, the total number of restriction fragments, the average fragment length and the redundancy is computed. The redundancy is defined as the average number of fragments that is produced per input full length DNA molecule and where each fragment's size is comprised in the same size range as used for the coverage computation. In the case of the "genomic DNA context", e.g. the yeast chromosomes, a suitable definition of the coverage percentage is automatically applied. The same formula is used but the N and T symbols have another meaning. N = total number of nucleotides present in the restriction fragments. T = total number of nucleotides in the sequences in the database entered in step 4. Step 7. In this optional step the user may choose to compute the expected fingeφrint for a specific pair of restriction enzymes, the found restriction fragments (satisfying the length criterion) are memorized. Specifically, for AFLP applications, the user can determine, via a simple interactive mechanism, which selective nucleotides are desired. The AFLP fingeφrint is then computed by selecting from this list of fragments those that match the specified selective nucleotides. This process is extremely fast and avoids recomputing the restriction fragments each time another pattern of selective nucleotides is specified. Even in situations where the database provides only partial sequence information the computation of such fingeφrint may be useful. For example, these expected fingeφrints will at least reveal which fragments are to be expected from the nucleotide sequences already present in the data base. As a consequence, fragments that have a length that are markedly different from those fragments can preferably be selected for DΝA sequencing thereby enhancing the process of identifying novel nucleotide sequences.

Step 8. Finally, an output of the results is produced. In the case of the coverage analysis an ordered list (ordered by coverage percentage) is generated containing for each restriction enzyme pair the percentage coverage and the above described items: total number of fragments, average fragment length and redundancy (the latter item is applicable only for the "cDΝA context"). Clearly, this output can readily be used to identify a restriction enzyme combination that has a high coverage and a desired level of redundancy. In case in Step 7 it was requested to compute a fingeφrint for a specific pair of restriction enzymes and a user-defined pattern of selective nucleotides, the output will also show an ordered list (by length) of fragments. For each fragment in this list also a number of fields will be written to the output. A typical output may contain the following fields:

(1) ID number: identification number of the fragment; (2) Length: length in base pairs of the fragment;

(3) %AT: percentage A+T in AFLP fragment;

(4) %CG: percentage C+G in AFLP fragment;

(5) ΝrSameL: number of fragments in the fingeφrint that have exactly the same length; (6) RecogSeql: 5' end of the labeled fragment showing the sequence recognized by the first specified restriction enzyme, referred to as REI;

(7) MinSeql: small stretch of nucleotides succeeding the REI recognition sequence. This information is quite valuable for the user because it allows to identify easily additional selective bases that are specific for the given fragment;

(8) MiniSeqII: small stretch of nucleotides preceding the recognition sequence of the second specified restriction enzyme, referred to as REII. This information is quite valuable for the user because it allows to identify easily additional selective bases that are specific for the given fragment; (9) RecogSeqII: sequence pattern recognized by REII; (10) Idtag: identification code, if available, of the full nucleotide sequence

(from which the fragment is derived) in the sequence database;

(11) FctDescr: annotation description, if available, of the full nucleotide sequence (from which the fragment is derived) in the sequence database;

(12) NrFragWithSamelD : number of fragments in the fingeφrint derived from the same sequence. Clearly, this column provides redundancy information;

(13) NrFragWithSameFct: number of fragments derived from sequences having the same functional descriptor as noted for the current fragment. Clearly this column is useful for subsequent data-mining question where fragments are searched (in the ouφut file) that share a common functional property;

(14) FullSequence: full nucleotide sequence of the fragment. This column is useful because it can be used in further bioinformatics handling processes, e.g. BLAST searches of the sequences shown in this column against public or private DNA sequence data bases.

The method has been implemented in a computer program stored in memory 5-11 and named REcomb. REcomb is an acronym for restriction enzyme COMBinatorics. This program is written in Delphi 3, a modern object oriented programming language. However, any other programming language could be used to implement the REcomb algorithm. The REcomb program provides the user with a number of options. These options do not modify the above described method but are rather intended to make a flexible use of this method. Important options are:

(a) the elimination of sequences that yield restriction fragments for any of a user defined list of restriction enzyme pairs. This option is extremely interesting, because it allows to search the optimal set of restriction enzymes on sequences that were not yet covered by another combination of restriction enzymes. (b) the definition of mixtures of restriction enzymes, that become equivalent to a 'super-restriction enzymes' in the REcomb algorithm. This may be of practical use in the case of, for example, the cDNA AFLP technique where it could be considered to use such mixtures in order to enhance the coverage. (c) the elimination of sequences on the basis of a length criterion. This criterion acts as a filter. The REcomb computations are then carried out on the remaining sequences. This option allows to focus on a subset of the sequences.

(d) the REcomb program allows also to enter in a mode in which for each DNA molecule only the 3'-most restriction fragment, if existing, is retained for the subsequent computational process. In such case a DNA molecule is considered as covered by a given set of restriction enzymes if (i) a 3' terminal restriction fragment exists and (ii) if this fragment matches the user defined size criterion. Clearly, this option is of high practical use when searching for an optimal combination of restriction enzymes for applications that are based on the enhanced cDNA AFLP technique.

Analyzing a sample with restriction enzymes.

In accordance with the invention, the one or more restriction enzymes determined by the method and arrangement described above are, preferably, used to analyze a nucleic acid sequence of a sample. To that end, the following steps schematically indicated in figure 8 are, e.g., carried out:

(a) a DNA sample representing one or more nucleic acid sequences is provided;

(b) after restricting, an amended DNA sample results containing restriction fragments;

(c) optionally, the amended DNA sample is treated with a compound such as a suitable fluorescent dye or radioactive label in order to facilitate detection in step (f); (d) the restriction fragments of the amended DNA sample are physically separated using a physical property like size, mass, shape, hydrophobicity, etc. Examples of systems used are a poly-acrylamide gelelectrophoresis system, a mass spectrometry system, a HPLC system, and a capillary gelelectrophoresis system;

(e) optionally, the separated fragments from step (d) are treated with a compound such as a suitable fluorescent dye or radioactive label to facilitate detection in step (f);

(f) a detection system measures the separated restriction fragments on the basis of a detectable property of DΝA, e.g. absorbance in UV light, or a property that has been added to the DΝA in step (c) and/or (e). The detection device may be part of the separation device used in step (d), e.g. a capillary electrophoresis system or may be independent of the device used in step (d): a digital scanner could, for instance, be used operating on an image recorded on an image plate obtained from exposing radioactively labeled fragments separated in step (d) in a gelelectrophoresis system;

(g) data as to the separated restriction fragments detected in step (f) is output to a user, for instance, through a printer or a monitor or on a diskette or other data carrier.

Preferably, the data is stored in a computer memory for later further analysis. Output may be in the form of images, peak profiles, curves, a datatable, etc.

In figure 8, a computer X. and a robot Y. are schematically shown. The computer carries out all controlling and analyzing tasks as is known to persons skilled in the art. The robot moves the samples as known to persons skilled in the art.

"Enhanced cDΝA AFLP"

In case of cDΝA samples, the method and arrangement of the present invention can advantageously be used in a mode referred to as "enhanced cDΝA AFLP". In this mode, for each DΝA molecule only the 3 '-most restriction fragment, if existing, is retained for the subsequent computational process. In such case, a DΝA molecule is considered as covered by a given set of restriction enzymes if (i) a 3' terminal restriction fragment exists and (ii) if this fragment matches the user defined length criterion.

However, the enhanced cDΝA AFLP method can also be applied in an actual: (a) digesting cDΝA with a first restriction enzyme, (b) capturing or isolating the 3'-terminal cDΝA restriction fragments and preferably removing all 5' restriction fragments, (c) digesting said 3'-terminal cDΝA restriction fragments with a second restriction enzyme, (d) eliminating the 3 '-terminal cDΝA restriction fragments and (e) subjecting the remaining restriction fragments to the AFLP method. Preferably only those restriction fragments flanked by restriction site of a first restriction enzyme and flanked by restriction site of a second restriction enzyme are subjected to the AFLP method. The enhanced cDΝA method allows the generation of only one restriction fragment per cDΝA molecule.

Thus, enhanced AFLP generally comprises the steps of: (a) providing at least one cDΝA; (b) attaching said cDΝA to a carrier, such as a bead, most preferably at the 3' end of said cDΝA;

(c) restricting said at least one cDΝA attached to said carrier with a first restriction enzyme, so as to provide a first restriction fragment that is (still) attached to the carrier at its 3'-end; and at least one second restriction fragment that is no longer attached to the carrier;

(d) removing said at least one second restriction fragment not attached to the carrier, e.g. by washing;

(e) restricting the first restriction fragment (still) attached to the carrier with a second restriction enzyme, which is most preferably different from said first restriction enzyme, so as to provide a third restriction fragment (still) attached to the carrier at its 3 '-end; and at least one fourth restriction fragment that is no longer attached to the carrier;

(f) detecting the at least one second, the third and/or the at least one fourth restriction fragment; optionally after amplification of said fragment(s), e.g. by AFLP®-methodology

Thus, the at least one second restriction fragment(s) obtained in step (c) essentially corresponds to a 5' part of the starting cDΝA (e.g. one or more segments starting from the 5' end); whereas the first restriction fragment (still) attached to the carrier essentially corresponds to the 3' part of said cDΝA. Accordingly, the at least one fourth restriction fragment(s) obtained in step

(e) essentially corresponds to a 5' part of the first restriction fragment; whereas the third restriction fragment (still) attached to the carrier corresponds to the 3' part of said first fragment.

Preferably, in step (f), the at least one fourth restriction fragment is detected. Also, most preferably, in step (f) only one fourth restriction fragment is generated, so as to provide only one specific restriction fragment for each starting cDΝA, which may then be amplified and detected.

Attachment of the cDNA to the carrier may be carried out in a manner known per se, and suitable carriers will be clear to the skilled person.

Examples

Example 1. REcomb: a general procedure to identify a set of restriction enzymes that maximally covers a pool of DNA molecules.

Often one wants to study the composition of a given sample containing a heterogeneous mixture of different DNA molecules. For example, in studies were one addresses the expression of genes, one may first isolate polyA mRNA from a given tissue. Subsequently, using the reverse transcriptase enzyme a population of cDNA molecules is obtained. The question then raises which different cDNA molecules are contained in the sample.

A now well established experimental approach would be to cut the given DNA sample with a set of restriction enzymes yielding a mixture of restriction fragments derived from the starting sample. This set of restriction fragments can then be characterized by RFLP, AFLP or other fingeφrinting techniques. For example, to produce a fingeφerint one could digest the DNA sample with say EcoRI and Msel, then ligate adapters to the produced restriction fragments and subsequently amplify the EcoRI/Msel fragments with suitable primers followed by separation of the fragments in an electrophoresis or other system of choice. But clearly depending on the choice of restriction enzymes a fraction of the initial

DNA molecules will not be detected in the subsequent fingeφrinting and hence these fragments are 'uncovered' by the used set of restriction enzymes. Clearly, in an optimally designed experiment we seek to identify a combination of restriction enzymes that covers maximally the different DNA molecules. The coverage can be defined as a percentage of the number of different DNA molecules, contained in the given DNA sample, that can be detected in the fingeφrint. More precisely Coverage = N / T x l00 where

N = number of molecules that give raise to one or more restriction fragments of which at least one was detected in the fingeφrint - T = total number of different molecules contained in the studied DNA sample.

If Coverage = 100 then each different DNA molecule in the DNA sample yields at least one detectable restriction fragment.

In the case REcomb is working on genomic DNA, e.g. the yeast chromosomes, a suitable definition of the coverage percentage is automatically applied. The same formula is used but the N and T symbols have another meaning. N = total number of nucleotides present in the restriction fragments. T = total number of nucleotides in the sequences read by REcomb.

To determine the optimal set of restriction enzymes it becomes highly desirable to have a computational procedure. Indeed, the assessment of an optimal set of restriction enzymes is a complex task because it will depend on several factors which act concurrently, such as: species specific base usage and DNA sequence composition length of the DNA molecules in the studied sample limits on the size range which can be experimentally detected: e.g. in a given experimental setup, detectable restriction fragments in the fingeφrint may have a size range between 50 and 500 nucleotides.

The second factor (DNA length) is especially important in cDNA fingeφrint analysis, where the average DNA length may be in the order of 1 kB. The length dependency becomes even more pronounced if, for example, one would use a technique in which for each original cDNA molecule only the 3 '-most restriction fragment, if existing, is retained for the fingeφrint analysis (enhanced AFLP).

REcomb - given herein below as a specific but non-limiting example of an algorithm of the invention - is a computational procedure that allows to derive the set of restriction enzymes giving maximal coverage of a given sample under study. In addition, without this computational procedure it becomes very hard, if not impossible, to estimate the coverage that has been reached for a given choice of restriction enzymes. In other words REcomb is also a computational procedure to estimate the coverage completeness for any user defined combination of restriction enzymes, also for deliberately chosen non-optimal combinations.

On the one hand REcomb uses a database of restriction enzymes describing the recognition sequence and cutting pattern for an arbitrary large collection of restriction enzymes. On the other hand REcomb uses a database containing DΝA sequences. For example, if one plans to design a fingeφrint to maximally cover the cDΝA molecules of say tomato, then this database should contain a representative number of cDΝA sequences for this species. Using the REcomb procedure on either the whole of the about 6000 known Yeast Orf sequences or on random subpopulations of varying size taken thereoff, it is clear that as little as 5% of the total sequence is needed to carry out accurate coverage predictions. As a consequence, if only a very limited number of sequences is known from a much larger collection of DΝA molecules, the REcomb procedure can still be usefully applied to design the fingeφrint experiment. In the event that no sequence information whatsoever is available for the species of interest, the REcomb procedure operates on a population of random DΝA molecules whose length distribution and base contents (%GC, %AT) can be tuned to match the statistical properties known or estimated for the given species of interest.

The REcomb procedure will compute the coverage for all pairs of restriction enzymes and for a given user defined size window of restriction fragments. Using a large database of commercially available restriction enzymes, all pairs can be evaluated in interactive time even on a large sequence database containing up to 10 Mbase of DΝA (e.g. all of the yeast Orfs). The REcomb procedure is neither theoretically, nor practically limited to the size of the sequence database. The required computational time will be linearly proportional to the size of the database.

Upon completion of the coverage computations all studied restriction enzyme pairs are ordered by coverage. These values together with additional statistical values are saved for archiving or possible further analysis. A typical example of part of the output in case of the "cDΝA context" is shown in figure 1 using the current publically available yeast Orf database (K. Heumann et al, The Yeast Genome CD-ROM from MIPS, the Munich information Centre for Protein Sequences). The first row of this table contains labels for each of the columns of the table shown in figure 1. From left to right these labels have the following meaning:

(a) REI: name of restriction enzyme I in the restriction enzyme combination. Note that only under special circumstances, such as in the case of enhanced cDNA AFLP, the order of the restriction enzymes in the restriction enzyme combination is important.

(b) N cut: selectivity of REI. N_cut is not necessarily an integer. For example, for MSEI which recognizes TTAA, N_cut = 4 but for Ava II which recognizes GGWCC (W = A or T]), N_cut is 4.5. (c) REII: name of restriction enzyme II in the restriction enzyme combination.

(d) N cut: selectivity of REII.

(e) % coverage : as defined above.

(f) TotNrFragments= total number of fragments produced by the given restriction enzyme pair. The length of each fragment is contained within the user specified size-window.

(g) AvFragLength = mean fragment length.

(h) NrSeq_0_Frag= the number of sequences that yield exactly 0 fragments within the user specified size range, (i) NrSeq_l_Frag= the number of sequences that yield exactly 1 fragment within the user specified size range.

(j) NrSeq_2_Frag= the number of sequences that yield exactly 2 fragments within the user specified size range, (k) NrSeq_3_Frag= the number of sequences that yield exactly 3 fragments within the user specified size range. (1) NrSeq_4_Frag= the number of sequences that yield exactly 4 fragments within the user specified size range. (m)NrSeq_GE5_Frag= the number of sequences that yield more than 4 fragments within the user specified size range. Note that columns (h) to (m) provide detailed redundancy information. The rows 2 to 12 of the table in figure 1 show typical data. Note that the rows are ordered by decreasing coverage (see the contents of column 4). Only the first 12 rows are shown in this table. This table may contain several hundreds of rows depending on the size of the used database of restriction enzymes.

A typical example of part of the output in case of the "genomic DΝA context" is shown in figure 2 using the current publically available yeast genomic database (K. Heumann et al, The Yeast Genome CD-ROM from MIPS, the Munich information Centre for Protein Sequences). The first row of this table contains labels for each of the columns of the table shown in figure 1. From left to right these labels have the following meaning: (a) REI: name of restriction enzyme I in the restriction enzyme combination. Note that only under special circumstances, such as in the case of enhanced cDNA

AFLP, the order of the restriction enzymes in the restriction enzyme combination is important.

(b) N cut: selectivity of REI. N cut is not necessarily an integer. For example, for MSEI which recognizes TTAA, N-cur=4 but for Ava II which recognizes

GGWCC (W-A orT]), N-cut is 4.5.

(c) REII: name of restriction enzyme II in the restriction enzyme combination

(d) N_cut: selectivity of REII.

(e) % coverage : as defined above. (f) TotNrFragments= total number of fragments produced by the given restriction enzyme pair. The length of each fragment is contained within the user specified size-window, (g) AvFragLength = mean fragment length. The rows 2 to 12 of the table in figure 2 show typical data. Note that the rows are ordered by decreasing coverage (see the contents of column 4). Only the first 12 rows are shown in this table. This table may contain several hundreds of rows depending on the size of the used database of restriction enzymes.

To optimize and fine-tune the REcomb procedure a number of options are included in the REcomb procedure: per default, restriction enzyme pairs that have compatible overhangs, thereby possibly leading to scrambling of the adapters in the AFLP fingeφrint, are discarded from the analysis. But if desired, this option can be deactivated, sequences whose full length lies outside a given user tunable size window are discarded from the coverage computation. This allows to focus on a specific sub- population of the sequence database. sequences can be discarded from the analysis that produce one or more restriction fragments by a user-defined collection of restriction enzyme pairs. This option allows to apply the REcomb procedure in an iterative mode. First one may identify the pair that gives the highest coverage. Subsequently, one may identify the pair of enzymes that gives the highest coverage on the remaining, uncovered, sequences. This process can be re-applied as many times as desired. By following this procedure a suite of restriction enzyme pairs can be identified that, together, yield 100%, or very close to 100%, coverage.

REcomb allows also to enter in the enhanced cDNA AFLP mode. In this mode, as explained above, for each DNA molecule only the 3 '-most restriction fragment, if existing, is retained for the subsequent computational process. In such case, a DNA molecule is considered as covered by a given set of restriction enzymes if (i) a 3' terminal restriction fragment exists and (ii) if this fragment matches the user defined length criterion. If the enhanced cDNA AFLP mode is activated, the REcomb procedure allows also that one of the restriction enzymes may itself be a mixture of other restriction enzymes (that are meant to be applied simultaneously to a given DNA sample). Each such mixture is assigned a unique name and is treated by REcomb as a super restriction enzyme characterized by a collection of recognition and cutting specificities.

REcomb is not restricted to the coverage analysis of DNA molecules. RNA sequences can also be subjected to the REcomb procedure by first converting these into cDNA sequences.

Given a specific pair of restriction enzymes and a user defined pattern of selective nucleotides, the expected AFLP fingeφrint is computed and written to an export file, together with detailed sequence information for each of the restriction fragments. This computation of the expected fingeφrint is important in two respects. Firstly, it will facilitate the inteφretation of an experimentally determined fingeφrint. Secondly, it is important for the efficient design of an AFLP experiment. For example, if too many fragments are expected in the fingeφrint, or if too many fragments are expected to have about the same size

(leading to overlapping fragments in the fingeφrint) other AFLP primer combinations can be tested in silico to alleviate this problem. The present invention therefore provides a computational procedure, and an arrangement for carrying out that procedure, which can be used to efficiently design and inteφretate experiments that address the characterization of heterogeneous mixtures of

DΝA molecules by fingeφrinting fragments derived by a restriction enzyme digestion of a given sample.

We conclude this example by giving a more detailed description of the Recomb program. Given a collection of DΝA sequences, REcomb is a tool:

(a) to search for pairs of restriction enzymes (or groups of restriction enzymes) which give optimal sequence coverage in a given AFLP experiment

(b) to predict the expected AFLP fmgeφrints for a specific primer-enzyme combination (PEC). REcomb addresses both standard as well as enhanced AFLP (cDΝA imaging) experiments.

Upon entry of REcomb a form is shown which provides buttons with callbacks to each of these two possible modes of operation (coverage analysis and prediction of AFLP fmgeφrints). We will now discuss each of these two main modes of the program.

Mode 1. Coverage computations

Mode 1. General operation

The coverage option is activated upon clicking, in the main form, a button labeled

"Compute Coverage". As a result, a new form is displayed. This form allows to specify the search conditions prior to the actual computation of the sequence coverage. A sequence is the to be covered by a given combination of restriction enzymes (REs) if it yields at least one fragment within a user-defined size range. This size range is set by the up-down dials labeled "Define window of fragment lengths".

The current data base that is accessed by REcomb can be defined by clicking the button "Load another data base". By mouse-clicking this button, a sequence data base can be loaded. This form has also buttons labeled "First RE" and "Second RE" which are used to define pairs of restriction enzymes for the coverage computation. By mouse-clicking on these buttons following a list box is shown from which one can select an entry. This listbox shows, in addition to the name of each restriction enzyme (known to REcomb from the database of restriction enzymes) its selectivity "n-cut" (e.g. n-cut=4 for MSE, n-cut=4.5 for Ava II etc). However, the user can also make a selection in a more general way. The listbox provides also following entries:

_• "any" : take all restriction enzymes known by REcomb;

• "n-cutter": take all restriction enzymes having a selectivity of n (where an entry in the lisbox is shown for each possible value of n-cut for the enzymes in the restriction enzyme database used by REcomb. For standard AFLP the definitions of "First RE" and "Second RE" may be interchanged without affecting the results. But if the user selects the "Enhanced AFLP" check-box, the meaning of these fields is as defined by the enhanced AFLP protocol. As a reminder of this procedure, an appropriate form is automatically shown, upon mouse-clicking the "First RE" or "Second RE" buttons, thereby providing the user with adequate assistance to define the first and second restriction enzymes.

Since in enhanced cDΝA AFLP it may sometimes be desirable to use mixtures of restriction enzymes as the 'primary' cutter, REcomb allows also to define such mixtures provided that the enhanced AFLP mode has been selected. Upon mouse-clicking an appropriately labeled button, a new form is displayed which guides the definition of mixtures of restriction enzymes. These mixtures are considered by REcomb as super- enzymes equipped with the appropriate semantics. The actual coverage computation is started by mouse-clicking the button labeled "Compute". First, the user is asked, via a dialog box, to specify the name of the output file (which is an Excel compatible file). Subsequently, the coverage is computed for all RE pairs that are compatible with the specifications as previously entered by the user. Finally, the coverage results are ordered (descending coverage) and exported to the Excel ouφut file. The computations can be done interactively even on a large sequence data base (such as yeast which contains about 8.9 Mb in the predicted ORFS) and even using a collection of about 40 restriction enzymes where both the first and second RE selections are set to "any ".

A typical example of part of the output is shown in the tables of figure 1 and 2 for respectively the "cDΝA context" (yeast ORFs) and the "genomic DΝA context" (yeast genomic DΝA). In both examples the first and second RE selections are set to "any ": the meaning of each the columns is as described above.

Mode 1. Specific aspects

The following four options are available as buttons equipped with appropriate callback procedures.

(a) Elimination of sequences that are cut by restriction enzyme pairs (RE pairs).

Upon selecting this option, a form is displayed to specify a list of RE pairs. All sequences that give one or more fragments for any of these RE pairs, are discarded from the coverage analysis. This option is very useful when the coverage statistics has to be computed in an iterative way. For example, suppose that a previous run of REcomb shows that BstY I / Mse I gives say 50% coverage of some collection of sequences. One may ask the question: which pair of REs covers maximally the remainder of the sequences. Clearly, by first removing all sequences that produce BstY I / Mse I fragments (within a given size range) and subsequently applying the coverage algorithm, the required result can be obtained.

(b) Elimination of pairs of restriction enzymes ( RE pairs).

When this option is selected, a form is displayed showing three dialog boxes. The first and second boxes list the available restriction enzymes for REI and REII.

The third box contains the pairs of REs that are discarded from the coverage analysis. REcomb, automatically places all RE pairs that have REs with compatible overhangs in this box. By clicking on any of these entries, the corresponding RE pair is removed from the list of rejected RE pairs. Alternatively, new RE pairs can be added to the third list box, by selecting a pair of REs, one from each of the listboxes I and II.

(c) Elimination of sequences by a length criterion.

Upon selecting this option a window is displayed presenting two boxes with edit-fields for numeric entry. The first box allows to specify lower and upper bounds on the lengths of the sequences that will be subjected to the coverage analysis. The defaults are set such that no sequences will be eliminated. The second box allows to select a random subset of the set of all sequences. To generate such a subset, a probability threshold has to be entered. For each sequence that satisfies the length criterion (defined in the first box), a random number in the interval 0 to 1 is taken. If this number is smaller or equal to the one specified in the second box, the sequence will be retained for further analysis. This option is often useful, to verify whether certain predictions remain valid if only a subset is known (as is often the case) of all sequences for some species of interest.

(d) Generation of a random sequence data set.

To guide the design of AFLP experiments, it is also possible to compute the coverage on a population of random DΝA fragments. By mouse-clicking in the main form the button labeled "Load another data base", and subsequently selecting, out of the shown listbox, the item named "RandomDΝA", a form is displayed to define the parameters needed to generate the population of random DΝA fragments. The user can specify: (i) the number of fragments, (ii) the mean fragment length and (iii) the standard deviation on the fragment length. The length distribution of these random fragments will follow a normal distribution. The sequence of each fragment will be randomly determined and will have a mean %GC content as set by the user. This population of random DΝA molecules is placed in the "cDΝA context". In the subsequent analysis these random DΝA molecules are treated in exactly the same way as for any other collection of nucleotide sequences in the "cDΝA context".

Mode 2. Prediction of AFLP fingeφrint

This mode of operation is activated by selecting the button, labeled accordingly, in the top form that is displayed upon activating Recomb. As a result a new form is displayed. Through this form a number of input parameters have to be specified by the user before the computation of the expected AFLP fingeφrint can be carried out:

(i) the name of the nucleotide sequence database. This database may be of any type ( "cDΝA context" or "Genomic DΝA context"). The user may also select a random sequence database. In such case this database is automatically considered of the type "cDΝA context" and the user is requested to define the random sequence database by exactly the same procedure as described above (Mode 1. Specific aspects; (c) Generation of a random sequence data set). (ii) the name of the first restriction enzyme (REI). The specifcation of this name is done in a user-friendly, mouse controlled way (via a button, which upon clicking produces a listbox with all restriction enzymes from which the user can select the appropriate entry). It is assumed that in the AFLP PCR process the primer which is specific for REI is labeled (e.g. ³P labeled in the case a radioactive label is used). This information is used to display, in the output of

REcomb, the single stranded nucleotide sequence of the labeled strand.

(iii) the name of the second restriction enzyme (REII). This is done in the same user-friendly way as for REI.

(iv) whether or not the user wants to work in the enhanced cDΝA AFLP mode. In the case the input nucleic acid sequence database is of the type

"genomic DΝA context", this option is not activated.

(v) the total length of the AFLP adapters. This parameter is used to determine the actual total length of each of the AFLP fragments.

(vi) the window of fragment length, including adapters, the length of which is as defined by (iv), within which the AFLP fingeφrint has to be computed. The lower and upper limits of this size-window are set in a user- friendly, mouse controlled way.

(vii) the pattern of selective bases following the recognition site for REI. This is done in a user friendly-way. A sequence stretch of nucleotides is shown, highlighting the recognition pattern of REI followed by a number of nucleotide positions (in total 10). Each position can be assigned (via a simple mouse clicking mechanism) to a specific nucleotide (A,C,G,T) or to a collection of degenerate nucleotides (Ν,RN), where N=[A,C,G,T]; W=[A,T]; Y=[C,T] , R=[A,G].

(viii) the pattern of selective bases following the recognition site for REII. This is done in a user friendly- way. A sequence stretch of nucleotides is shown, highlighting the recognition pattern of REII followed by a number of nucleotide positions (in total 10). Each position can be assigned (via a simple mouse clicking mechanism) to a specific nucleotide (A,C,G,T) or to a collection of degenerate nucleotides (N,R,Y), where N=[A,C,G,T]; W=[A,T]; Y=[C,T] , R=[A,G].

(ix) the elimination of sequences based on a length criterion. Optionally the user may remove sequences that have a length that falls outside a user- defined range of length. If the user selects this option, this range has to be specified by a simple mouse-driven mechanism. Once all required input parameters have been specified the expected AFLP fingeφrint is computed and the results are exported to an ouφut file. The computation of the AFLP fingeφrint follows as described in step 7 of the "detailed description of the used method". Figure 3 shows part of the export file after a REI=BstY I, REII=Mse I digest on the yeast ORFs was made. The size-window was set to 100-600 nucleotides; the total adapter length was set to 20; the enhanced cDNA option was not selected; the pattern of selective bases was set to TAC for REI and CC for REII. Note that in the pattern TAC for REI=BstY I the first nucleotide T, provides a specification of the degenerate Y in the BSTY-I recognition site. The columns shown in figure 3 have the following meaning: (1) Nr: ordinal fragment number.

(2) Length: length in base pairs of the AFLP fragment (including adapters). (3) %AT: percentage A+T in AFLP fragment.

(4) %CG: percentage C+G in AFLP fragment.

(5) ΝrSameL: number of fragments in the fingeφrint that have exactly the same length. (6) RecogSeql: 5' end of the labeled fragment showing the sequence recognized by REI.

(7) MinSeql: small stretch of nucleotides succeeding the REI recognition sequence. This information is quite valuable for the user because it allows to identify easily additional selective bases that are specific for the given fragment. (8) MiniSeqll: small stretch of nucleotides preceding the REII recognition sequence. This information is quite valuable for the user because it allows to identify easily additional selective bases that are specific for the given fragment.

(9) RecogSeqII: sequence pattern recognized by REII.

In addition the output file shows for each fragment also a number of additional columns. These are not shown in figure 3 since some of these columns may be very long and therefore cannot be rendered in a simple table form which is easily printable on paper. These additional columns are:

(10) Idtag: identification code, if available, of the full nucleotide sequence (from which the AFLP fragment is derived) in the sequence database. (11) FctDescr: annotation description, if available, of the full nucleotide sequence (from which the AFLP fragment is derived) in the sequence database.

(12) ΝrFragWithSamelD : number of fragments in the fingeφrint derived from the same sequence. Clearly, this column provides redundancy information.

(13) ΝrFragWithSameFct: number of fragments derived from sequences having the same functional descriptor as noted for the current fragment. Clearly this column is useful for subsequent data-mining question where fragments are searched (in the ouφut file) that share a common functional property.

(14) FullSequence: full nucleotide sequence of the labeled DΝA strand of the AFLP fragment. This column is useful because it can be used in further bioinformatics handling processes, e.g. BLAST searches of the sequences shown in this column against public or private DΝA sequence databases. Example 2. Use of the program Recomb to assist a DNA sequencing project. Example 2. The problem.

A number N of differentially expressed fragments have been identified using the standard cDNA AFLP procedure. These N fragments are derived from a number S of different DNA sequences. It is also given that, in view of the redundancy in the standard AFLP procedure, S < N (in other words the number of fragments is larger than the number of different sequences). To identify these S sequences, one decides to start a sequencing project on the collection of N fragments. First, the fragments are ordered in descending length order and the sequencing work progresses along this order. The question is: what is the percentage P$ of detected sequences if a given percentage PN of the fragments has been sequenced?

Example 2. Answer. To answer this question, we assume that the N differentially expressed fragments are more or less homogeneously collected from the different primer- enzyme combinations that have been used in the AFLP experiment. A simulation experiment was carried out in the program REcomb to get an estimate of the relation between Pg and P] f. 10% of the Yeast ORF sequences were randomly picked and subjected to standard AFLP using BstY I / Mse I. A small fraction (10%) was taken, to mimic that only a fraction of the sequences will show up as differentially expressed. We noted in subsequent experiments that, as expected, the results are not affected by taking a larger or smaller fraction of sequences. REcomb was working in the mode where it predicts the AFLP fingeφrint. No selective bases were specified because we assume that fragments that will be sequenced are derived homogeneously from the sequence space spanned by e.g. the various selective bases used in the AFLP experiments. From the collection of produced fragments, computed by REcomb, and the collection of input sequences the relation Pg versus P]^ was determined as shown in the figure 4. This plot was made using the program Excel operating on the ouφut file produced by REcomb. The ordinate shows the percentage of sequences that are detected as a function of the percentage, shown in the abscissa, of AFLP fragments that have been sequenced. Clearly this plot can be used to estimate the sequencing effort that is required to hit a required percentage of the detectable sequences. About 85% of the sequences will have been detected by sequencing 50% of the fragments. As can be seen from this figure, the sequencing effort becomes rapidly more intensive if a higher percentage of the sequences need to be detected.

Example 3. Use of the program REcomb in case only partial cDΝA sequence information is available. Example 3. The problem.

Often, the user has only have a very limited amount of cDΝA type sequences from a given species type (mais, tomato, etc) to carry out the coverage analysis. Hence the following question. Is it possible to make reliable predictions if only a small fraction of the sequences are available? Example 3. Answer.

To answer this question, REcomb was applied to a varying percentage P of the yeast ORFs. Studied values of P values were 100, 50, 40, 30, 20, 10, and 5. At each value of P, sequences were randomly picked and the coverage percentages were computed in the enhanced AFLP mode. The frequent cutter was fixed at REII=Mse I and the rare-cutter REI was set to 'any'. At each value of P, a REcomb simulation was carried and the coverage results were saved in an output file. Subsequently, all the seven output files, one for each value of P, were further processed using the Excel program producing figure 5. This figure shows a table composed of 7 colurnn- blocks. Each column-block contains the results for the percentage P shown at the top of the column-block. From left to right these columns-blocks are labeled respectively 100%, 50%, 40%, 30%, 20%, 10% and 5%. Each column block contains two columns: the first denotes the name of REI and the second the corresponding percentage coverage obtained by using REcomb. It is seen that the coverage percentages are rather well conserved at varying P. The topmost 4 restriction enzymes (REs), seen at P =100, are also among the best predicted at lower P. This applies also at very low P (at P =5, only about 322 sequences were randomly taken from the ensemble of Yeast ORFs). The top REs predicted at P =5 are among the top scores at P=100. In addition the coverage percentages for these top scores are all about the same. In conclusion, REcomb can successfully be used to search for restriction enzyme combinations giving high coverage also when applied to partial sequence information in the "cDΝA context".

Example 4. Use of the program REcomb in case only partial genomic DΝA sequence information is available. Example 4. The problem.

Often, the user has only a very limited amount of genomic-type DΝA sequences from a given species type (mais, tomato, etc) to carry out the coverage analysis. Hence the following question. Is it possible to make reliable predictions if only a small fraction of the genome sequence is available? Example 4. Answer.

To answer this question, REcomb was used to search for restriction enzyme combinations that yield high coverage. Both the first (REI) and second on (REII) restriction enzymes where set to "any", hence all possible restriction enzyme pairs were searched. In one experiment, referred to as experiment A, REcomb was applied to the whole of the known DΝA sequences for all yeast chromosomes. This implies a total of 12,146,943 nucleotides. In another experiment, referred to as experiment B, REcomb was applied to yeast chromosome I. Experiment B implies a total of 280,209 nucleotides. Hence, in this second experiment REcomb is operating on only 2.3% of the known DΝA sequences for all yeast chromosomes. It is observed that the highest coverage in experiment A is seen for the restriction enzyme pair Hinf I/Mse I, yielding a coverage of 42%. This combination yields also an excellent coverage, namely 39%, in experiment B where only chromosome I is considered. Conversely, the highest coverage in experiment B is seen for Νiall/Msel yielding 44% coverage. Interestingly, this pair gives a 42% coverage in experiment A, equally high as found for the Hinf I/Msel combination. In conclusion, REcomb can successfully be used to search for restriction enzyme combination giving high coverage also when applied to partial sequence information in the "genomic DΝA context".

Claims

1. A method for determining one or more restriction enzymes for analyzing a nucleic acid sequence of a sample belonging to a species, comprising the steps of: a) reading first data relating to a plurality of restriction enzymes specifying at least recognition sequence and cutting pattern per restriction enzyme from a first database; b) reading second data relating to at least a representative number of nucleic acid sequences of the species from a second database; c) determining one or more restriction enzymes that, if applied to the nucleic acid sequences, would result in restriction fragments having a size within a user defined window; d) presenting said one or more restriction enzymes to a user.

2. A method according to claim 1 , wherein, in step c), one or more combinations of two or more of the restriction enzymes are determined.

3. A method according to claim 2, wherein, after step a), restriction enzymes combinations are identified such that adapters can ligate only to one type of DΝA end, as produced by restriction enzymes of said restriction enzyme combination if applied to said nucleic acid sequence, in order to prevent adapter scrambling in a ligation process, and using in step c) only restriction enzyme combinations so identified.

4. A method according to claims 1 , 2 or 3, wherein, after step a), for each restriction enzyme a longest contiguous part of its recognition sequence not containing any degenerate position is determined, and said longest contiguous part per restriction enzyme is stored in memory (5-11).

5. A method according to any of the preceding claims, wherein, in step b), a population of random DΝA fragments is provided in said second database.

6. A method according to any of the preceding claims, wherein, after step b, data relating to said nucleic acid sequences is chopped in segments of a predetermined size, said segments are concurrently written into memory (5-11) and, each time such a segment is written into said memory (5-11), step c is carried out on such a segment and its preceding segment.

7. A method according to any of the preceding claims, wherein, coverage of restriction fragments relative to total nucleic acid content in said sample is computed, coverage being defined as ratio of a first quantity computed on said restriction fragments to a second quantity computed on total number of nucleic acid sequences in said sample.

8. A method according to claim 7, wherein said first and second quantities, respectively, used to compute said coverage are number of said restriction fragments and total number of nucleic acid sequences in said sample, respectively.

9. A method according to claim 7, wherein said first and second quantities, respectively, used to compute said coverage are total number of nucleotides contained in said restriction fragments and total number of nucleotides in said nucleic acid sequences of said sample, respectively.

10. A method according to any of the preceding claims, wherein, in step b), said representative number is at least 2 %, preferably at least 5 %, more preferably at least 10% of all nucleic acid sequences of said species.

11. A method according to any of the preceding claims, wherein said second data relates to at least one of the following set of features of said nucleic acid sequence to be analyzed: nucleotide sequence; percentage GC; - percentage AT; number of open reading frames; introns/exons; known genes; known regulatory sequences.

12. A method according to any of the preceding claims, wherein the second data is either obtained by analyzing said nucleic acid sequence of said sample as to its nucleotide sequence structure or from a similar species.

13. A method according to any of the preceding claims, wherein either a combination of a frequent cutter and a rare cutter or a combination of two frequent cutters is determined.

14. A method according to any of the preceding claims, wherein at least one of the following output data is calculated: - sizes of restriction fragments that would result if said one or more restriction enzymes would be applied to said nucleic acid sequence; number of times said nucleic acid sequence would be restricted if said one or more restriction enzymes would be applied to said nucleic acid sequence; a redundancy calculation, rendering, for said user defined window, an expected number of restriction fragments expected per restriction nucleic acid sequence if said one or more restriction enzymes would be applied to said nucleic acid sequence.

15. A method according to any of the preceding claims, comprising the further step of analyzing said nucleic acid sequence of said sample using one or more of said restriction enzymes determined in step c).

16. A method according to claim 15, wherein the further step comprises the following sub-steps: si) restricting at least part of said sample with said one or more of said restriction enzymes to render restriction fragments; s2) amplifying said restriction fragments to render amplified restriction fragments; s3) analyzing said amplified restriction fragments by any one of the following set of analyzing techniques:

- genotyping; - DNA fingeφrinting;

- transcript imaging using AFLP;

- fingeφrinting of messenger RNA.

17. A method according to any of the preceding claims, wherein, in step c), only those combinations of two or more of said restriction enzymes are determined that, if applied to said nucleic acid sequences, would result in 3'-most restriction fragments for each DNA molecule of said sequences.

18. An arrangement for determining one or more restriction enzymes for analyzing a nucleic acid sequence of a sample belonging to a species, comprising processor means (1), memory means (5-11) connected to said processor means (1) for storing data, input means (13, 15, 17, 25) connected to said processor means for inputting data and instructions to said processor means (1), the arrangement being programmed to carry out the steps of: a) reading first data relating to a plurality of restriction enzymes specifying at least recognition sequence and cutting pattern per restriction enzyme from a first database; b) reading second data relating to at least a representative number of nucleic acid sequences of said species from a second database; c) deteiTrjining one or more of said restriction enzymes that, if applied to said nucleic acid sequences, would result in restriction fragments having a size within a user defined window; d) presenting said one or more restriction enzymes to a user.

19. An arrangement according to claim 18, arranged to determine, in step c), one or more combinations of two or more of the restriction enzymes.

20. An arrangement according to claim 19, arranged to identify, after step a), restriction enzyme combinations such that adapters can ligate only to one type of DNA end, as produced by restriction enzymes of said restriction enzyme combination if applied to said nucleic acid sequence, in order to prevent adapter scrambling in a ligation process, and using in step c) only restriction enzyme combinations so identified.

21. An arrangement according to claim 18, 19 or 20, arranged to determine, after step a), for each restriction enzyme a longest contiguous part of its recognition sequence not containing any degenerate position, and to store said longest contiguous part per restriction enzyme in memory (5-11).

22. An arrangement according to any of the claims 18 through 21, arranged to read, in step b), a population of random DNA fragments from said second database.

23. An arrangement according to any of the claims 18 through 22, arranged to chop, after step b, data relating to said nucleic acid sequences in segments of a predetermined size, to write said segments concurrently into memory (5-11) and, each time such a segment is written into said memory (5-11), to carry out step c) on such a segment and its preceding segment.

24. An arrangement according to any of the claims 18 through 23, arranged to compute coverage of restriction fragments relative to total nucleic acid content in said sample, coverage being defined as ratio of a first quantity computed on said restriction fragments to a second quantity computed on total number of nucleic acid sequences in said sample.

25. An arrangement according to any of the claims 18 through 24, wherein, in step b), said representative number is at least 2 %, preferably at least 5 %, more preferably at least 10% of all nucleic acid sequences of said species.

26. An arrangement according to any of the claims 18 through 25, wherein said second data relates to at least one of the following set of features of said nucleic acid sequence to be analyzed: nucleotide sequence; - percentage GC; percentage AT; number of open reading frames; introns/exons; known genes; - known regulatory sequences.

27. An arrangement according to any of the claims 1 8 through 26, wherein the second data is either obtained by analyzing said nucleic acid sequence of said sample as to its nucleotide sequence structure or from a similar species.

28. An arrangement according to any of the claims 18 through 27, wherein either a combination of a frequent cutter and a rare cutter or a combination of two frequent cutters is determined.

29. An arrangement according to any of the claims 18 through 28, arranged to calculate at least one of the following output data: sizes of restriction fragments that would result if said one or more restriction enzymes would be applied to said nucleic acid sequence; number of times said nucleic acid sequence would be restricted if said one or more restriction enzymes would be applied to said nucleic acid sequence; a redundancy calculation, rendering, for said user defined window, an expected number of restriction fragments expected per nucleic acid sequence if said one or more restriction enzymes would be applied to said nucleic acid sequence.

30. An arrangement according to any of the claims 18 through 29, comprising analyzing means to analyze said nucleic acid sequence of said sample using one or more of said restriction enzymes determined in step c).

31. An arrangement according to claim 30, wherein the analyzing means comprises means for carrying out the following sub-steps: si) restricting at least part of said sample with said one or more of said restriction enzymes to render restriction fragments; s2) amplifying said restriction fragments to render amplified restriction fragments; s3) analyzing said amplified restriction fragments by any one of the following set of analyzing techniques: - genotyping;

- DΝA fingeφrinting;

- transcript imaging using AFLP;

- fingeφrinting of messenger RΝA.

32. An arrangement according to any of the claims 18 through 31, arranged to determine, in step c), only those combinations of two or more of said restriction enzymes that, if applied to said nucleic acid sequences, would result in 3'-most restriction fragments for each DΝA molecule of said sequences.

33. A computer program product to be loaded by a computer arrangement and comprising data and instructions for determining one or more restriction enzymes for analyzing a nucleic acid sequence of a sample belonging to a species, in accordance with the steps of: a) reading first data relating to a plurality of restriction enzymes specifying at least recognition sequence and cutting pattern per restriction enzyme from a first database; b) reading second data relating to at least a representative number of nucleic acid sequences of said species from a second database; c) determining one or more of said restriction enzymes that, if applied to said nucleic acid sequences, would result in restriction fragments having a size within a user defined window; d) presenting said one or more restriction enzymes to a user.

34. A computer readable data carrier provided with a computer program product according to claim 33.

35. A method of analyzing a cDΝA sample, comprising the following steps:

(a) digesting said cDΝA sample with a first restriction enzyme,

(b) capturing or isolating 3 '-terminal cDΝA restriction fragments and, preferably, removing all 5' restriction fragments, (c) digesting said 3'-terminal cDΝA restriction fragments with a second restriction enzyme,

(d) eliminating the 3 '-terminal cDΝA restriction fragments and

(e) subjecting at least those restriction fragments flanked by restriction site of the first restriction enzyme and flanked by restriction site of the second restriction enzyme of the remaining restriction fragments to an AFLP method.