WO2003065247A2

WO2003065247A2 - Analysis of biochemical sequence data

Info

Publication number: WO2003065247A2
Application number: PCT/EP2003/001031
Authority: WO
Inventors: Wim Van Criekinge
Original assignee: Devgen Nv
Priority date: 2002-02-01
Filing date: 2003-01-31
Publication date: 2003-08-07
Also published as: WO2003065247A3; AU2003206820A1

Abstract

Methods and apparatus for analysing biochemical sequence data are disclosed. An iterative process is used in which result sequences from an alignment exercise are used as query sequences for subsequent alignments. A query sequence may be from a different organism to a dataset used for alignment, and the organism of the alignment dataset may vary or swap over between iterations. Apparatus for carrying out the methods may include a controlling head node, a number of parallel alignment modes and a storage area network.

Description

ANALYSIS OF BIOCHEMICAL SEQUENCE DATA

The present invention relates to methods and apparatus for analysing biochemical sequence data, and in particular, but not exclusively to methods and apparatus for carrying out a recursive alignment exercise on polynucleotide or polypeptide data, to find result sequences related to a query sequence.

Biochemical sequence data may represent a variety of naturally occurring and synthetic chemical structures, including polynucleotides and polypeptides. The term polynucleotides includes different types and occurrences of DNA, including nuclear DNA and cDNA, as well as RNA types such as ribosomal and messenger RNA. The term polypeptides as used herein includes polypeptides and fragments thereof, proteins and fragments thereof, oligopeptides and other amino acid sequences. Biological sequence data may be relatively unprocessed, such as raw sequencing data (e.g. from the human genome project or other sequencing projects such as the C. elegans-, Drosophila or mouse sequencing projects) , in collections or databases of expressed sequence tags containing uncharacterised sequence fragments, each fragment containing errors and unknown residues.

Equally, it may be carefully processed and reviewed, such as in the case of consensus genome or protein data with associated annotations, chromosome location information and so on. In general, the biological sequence data used in the invention may be publicly available (e.g. as part of publicly available databases, such as those referred to hereinabove and below) and/or may be obtained by cloning/sequencing efforts carried out in a suitable manner known per se. Over the past three decades a variety of techniques and tools have become available to assist in the task of analysing and making use of the ever increasing quantity of biochemical sequence data available both commercially and in the public domain. A particular task of such analysis is the searching of a database of sequence data for target sequences which are similar to, or align with a query sequence, to a given level of similarity. For example, a researcher may wish to search the genome data of a mouse for target DNA sequences which correspond closely to a query human DNA sequence which encodes for a protein of pharmaceutical significance. Early endeavours to provide useful tools in this field resulted in the global alignment algorithm of Needleman and unsch (1970), Journal of Molecular Biology, 48, 443 - 453, and the local alignment algorithm of Smith and

Waterman (1981) , Journal of Molecular Biology, 147, 195 - 197. More recent tools for carrying out sequence alignment tend to make use of heuristics to speed the search for significant alignments within vast databases. The BLAST algorithm described by

Altschul et al. (1990), Journal of Molecular Biology, 215, 403 - 410 searches for significant matching subsequences, termed segment pairs. To establish complete alignments the segment pairs are extended until certain threshold parameters are achieved. More recently available versions of the BLAST software and similar tools are better able to take account of gaps and missing segments in both the query sequence and target sequences in the database, for example see Altschul et al . (1997), Nucleic Acids Research, 25 (17) , 2289 - 3402.

The sensitivity of sequence alignment methods can be adjusted by setting various parameters and thresholds. Penalties are generally applied in allowing the substitution of DNA residues or amino acids, the introduction and extension of gaps and the deletion of residues in target sequences, and these penalties may be included in a final measure of similarity between a query sequence and a target sequence. A final measure of similarity between a query sequence and a target sequence may typically be represented as a probability indicative of the likelihood that such an alignment could have arisen by chance, or as an expectation value indicative of the number of such alignments one would expect to have arisen by chance given the sizes and natures of the query sequence and target database. In both cases, a low probability or a low expectation value is indicative of an alignment having biological significance .

By raising or lowering the threshold of the acceptable level of similarity, more or fewer "hits", that is sequences in the target database which have an acceptable similarity to the query sequence, are chosen. Therefore, to increase the sensitivity of a search and increase the number of alignment hits one may simply adjust the similarity threshold.

Unfortunately, this generally has the effect of rapidly selecting a large number of spurious or irrelevant hits, without finding many new sequences of biological relevance in the database. Adjusting parameters relating to residue substitutions and sequence gaps may assist in increasing the proportion of biologically relevant hits, but requires considerable skill, judgement and expert time, and may increase further the already heavy computational requirements of the alignment exercise.

The present invention seeks to address problems and disadvantages of the related prior art. Accordingly, the invention provides a method of analysing a biochemical sequence database comprising the steps of:

(a) providing an initial query sequence;

(b) carrying out an alignment of the query sequence against the database to establish result sequences which resemble the query sequence according to a measure of similarity; and

(c) if any result sequences are established and unless a stop condition is met, automatically repeating steps (b) and (c) using each of one or more of said result sequences as a query sequence.

The invention also relates to a method for analysing an initial query sequence using a biochemical sequence database, which method comprises steps (a) to (c) above. This method can also be used for comparing the initial query sequence with the database, for example to determine if the database contains (a) similar sequence (s) (e.g. homologous or orthologous sequences).

As further described below, in the methods of the invention, the query sequence may also be part of a set of two or more query sequences. Usually, these query sequences will in some way be related - again as further described below although the invention in its broadest sense is not limited thereto.

Thus, the invention also relates to a method for analysing a biochemical sequence database using a set of two or more sequences, which method comprises steps (a) to (c) above, in which - in step (a) - each sequence from the set of sequences is used as a query sequence; and in which steps (b) and (c) are carried out for/using each sequence from said set of sequences as a query sequence. The invention also relates to a method for analysing a set of sequences, using a biochemical sequence database, which method comprises steps (a) to (c) above, in which - in step (a) - each sequence from the set of sequences is used as a query sequence; and in which steps (b) and (c) are carried out for/using each sequence from said set of sequences as a query sequence . Also, the invention relates to a method for comparing two or more sets of sequences/sequence data, which method comprises steps (a) to (c) above, in which - in step (a) - each sequence from at least one set of sequence data (the first set) is used as the initial query sequence; in which - in step (b) - (the sequences of) at least one other set of sequence data is used as the database; and in which steps (b) and (c) are carried out for/using each sequence from said first set as a query sequence. As further described below, the latter embodiment may in particular be used for comparing two databases of sequences/sequence data, such as databases containing part(s) and or the whole of the genome of a species . The biochemical sequence database is a collection or any part of a collection of biochemical sequence data of any suitable form, and preferably contains polynucleotide sequence data, polypeptide sequence data, or both. The sequence data may be fragmented, contain errors and ambiguities, and may be organised in a variety of ways as will be familiar to the person skilled in the art. The initial and each subsequent query sequence is a biochemical sequence of a data type and form which can be matched against sequences in the database, each match being quantifiable by a measure of similarity. Generally, such a measure of similarity may be based on the degree of sequence identity (e.g. based on degree of homology and the length of the sequence) between the target sequence and (part of) the database, as is well known in the art. Generally, algorithms for sequence comparison will allow one or more relevant parameters to be (pre-) set. For example, when a version of BLAST is used in the invention, the measure of similarity may be (pre-) set as an expectation value or p-value, for which reference is again made to the article by Altschul et al. cited above, as well as the general manuals for such BLAST algorithms.

In general, for each (set of) query sequence (s) and each database, a suitable p-value may be established by a limited degree of trial-and-error, in order to establish the setting (s) that provide a suitable or desired number of result sequences. In the embodiments described below, the p-value is typically set at a value of less than 10^"3, and preferably less than 10^"10, and may for example be from a value of 10^" ³⁰ up to and including a value of 0.

It is also within the scope of the invention to (pre-set the algorithm used to) change the measure of similarity (e.g. the p-value) from iteration to iteration, for example to increase or decrease (and preferably decrease) the p-value with/for each subsequent iteration.

The biochemical sequence data (i.e. the query sequence) may be a single sequence or (may be part of) a set of two or more related sequences, for example two or more sequences that have been derived from the same species of plant, animal or micro-organism. Embodiments of the invention may be used to compare a sequence or a set of sequences from a first species (with each sequence being used as a query sequence) to a set of sequences from a different, second species (with said set of sequences being used as the database) .

For example, such embodiments may be used to compare part(s) or essentially all of the genome - for example in the form of a set of genomic sequences, cDNA sequences, mRNA sequences or and/or raw sequencing data; and being used as the query sequence (s) - from a first species with part(s) or essentially all of the genome of a second species - again for example in the form of a set of genomic sequences, cDNA sequences, mRNA sequences and/or raw sequencing data; and being used as the database - thus generating a set of result sequences that represents a genome/genomic comparison between the two species. To further improve such a comparison between such sets of sequences - and in particular to improve comparisons of (part(s) of) genomes - such an analysis may be repeated using sequence (s) from the second set as query sequences and using the first set of sequences as the database. Thereafter, the result sequences generated from each of these analyses may be compared, to determine which sequences are obtained as a result from both analyses.

Thus, the invention may for example be used to compare a sequence or set of sequences from a vertebrate organism (such as a fish, for example a zebrafish; a bird; and/or a mammal, for example a mouse, rabbit, rat, monkey or human) with a database of sequences from an invertebrate organism (such as an insect, for example Drosophila melanogaster; or a nematode such as C. elegans) , or vice versa; to compare a sequence or set of sequences from a nematode (such as C. elegans) to a database of sequences from a insect (such as Drosophila or an insect pest such as the Tobacco Budworm Heliothis virescens or a species of aphid) , or vice versa; to compare a sequence or set of sequences from one insect (such as Drosophila ) to a database of sequences from another insect (such as an insect pest) , or vice versa; and/or to compare a sequence or set of sequences from a first species of vertebrate animal (for example a mouse) to a database from a second species of vertebrate (for example a human), or vice versa.

Suitable databases are generally available (for example from EMBL, GenBank or the NCBI) and include, but are not limited to, WormBase, WormPD, FlyPep, GenPep, WormPep, and the well-known databases containing sequence data from the human genome.

Other possibilities will be clear to the skilled person and may for example include sequences, sets of sequences and/or databases that have been derived from plants and/or micro-organisms (for example such yeast, fungi, bacteria, algae, viruses and/or (other) unicellular organisms) .

Also, in embodiments of the invention, and before carrying out steps (b) and (c) , the initial query sequence may be converted to a type of sequence that is present in, or at least compatible with, the database used. For example, when the query sequence is a DNA- or RNA-sequence, it may first be converted - on the basis of well-known codon usage - to a protein sequence or set of protein sequences, which may then be compared to a database of protein sequences. Computer algorithms for carrying out such conversions are well known in the art, and may for example be incorporated in the alignment algorithm used. Starting with an alignment of an initial query sequence against a part or the whole of the database, each subsequent level of iteration of steps (b) and (c) involves alignments against a part or the whole of the database of all or some of the result sequences from the previous level of iteration. Using the method, sequences in the database having a likely biological relationship with the initial query sequence can be located more easily than in prior art methods, while reducing the proportion of biologically irrelevant result sequences, for a given measure/threshold of similarity.

The database may contain target sequence data relating substantially or predominantly to only one organism type, such as species or other suitable grouping of biochemical sequence data. The biochemical sequence data may be compared to a database from one organism type, or to two or more databases from two or more different organism types. Alternatively, the database may contain substantial amounts of sequence data relating to two or more different organism types. Applying the method to such plural databases and/or plural organism type data has a significant beneficial effect in further improving the efficiency and scope of selection of biologically relevant result sequences. This is because relevant sequences not found in searching sequence data relating to a first organism type may be found indirectly by identifying relevant sequences in a second organism type and using these sequences as query sequences in alignments against the first organism type.

The invention may also be used to establish a set of result sequences that are shared between two or more sets of biological sequence data (e.g. databases) - i.e. above a certain measure/threshold of similarity - or conversely to establish a set of result sequences that are unique to each set (i.e. that do not have equivalents in the other set(s)). As will be clear to the skilled person, this may be particularly useful when sequence data from two or different organisms is to be compared (and in particular for genome comparisons), i.e. to establish sequences (i.e. genes) that are conserved between these species, or alternatively sequence (i.e. genes) that are unique to each species.

Advantageously, therefore, each of some or all of the alignments of steps (b) and (c) may be carried out on database sequence data relating to two or more organism types. Alternatively, each of some or all of the alignments of steps (b) and (c) may be carried out on database sequence data relating, in each case, to only a single organism type. In this alternative, it is advantageous for some or all alignments to be carried out on database sequence data relating to a different organism type to that of the query sequence in each case. Various mixes of each of these different arrangements may be used to advantage to improve the results of the analysis.

It is also within the invention to compare a query sequence with two or more databases of biochemical sequence data from the same species, for example with a database of protein sequences and a database of nucleotide sequences (where necessary upon suitable conversion of the query sequence as described above) .

Preferably, steps (b) and (c) of the above method are only repeated using as query sequences, in respect of database sequence data relating to a particular organism type, those result sequences which have not previously been used as query sequences in respect of that organism type. In this way, the initial query sequence and all subsequent query sequences may be considered to form a non-cylical tree structure with a minimum of redundancy, repetition and use of computer processing time. It may be useful, however, to record and output result sequences which have also been previously identified by searching with different query sequences.

Preferably, the stop condition includes the number of levels of iteration of steps (b) and (c) being equal to a preset threshold, to control the number of query and result sequences. Alternatively, an upper limit could be set on the total number of query sequences, or number of actual or different result sequences produced or output. If the query sequence, the alignment parameters and the measure of similarity are well formulated then the method may typically terminate through not finding any further new result sequences after only a few levels of iteration. Therefore, a suitable preset threshold for the number of iterations may be between 2 and 20, preferably between 5 and 15, such as about 10.

Alternatively, the stop condition may be that the result sequences obtained have become convergent, by which is meant that a further iteration of steps (b) and (c) , using the set of result sequences from the last iteration as query sequences, does not provide any additional result sequences (i.e. compared to the set of result sequences from the last iteration.)

When the stop condition is reached, the result sequences may be stored on a suitable data carrier (e.g. as further described below) and/or may be displayed and/or represented (e.g. again as further described below) , for example on a computer screen or on paper.

Preferably, and independent of the stop condition (s) used, at least two, and preferably at least three iterations of steps (b) and (c) are used. The invention also provides apparatus adapted to carry out the steps of the above-mentioned methods, a computer program product comprising computer program instructions to control a computer to carry out the steps of the above-mentioned methods and a computer readable medium, such as a floppy disk or CD-ROM carrying such a computer program product. The computer program product need not include the software required to carry out the actual alignment processes, as it is envisaged that these will frequently be carried out using third party software. Consequently, the step of carrying out an alignment should be understood to mean, where necessary, the step of instructing software or hardware to carry out such an alignment. The invention also provides a computer system for carrying out analysis of a biochemical sequence database, comprising: a storage area network adapted to store the database; a plurality of alignment nodes, each alignment node being operable, in response to an instruction, to carry out the alignment of a query sequence against at least a part of said database to obtain one or more result sequences which resemble the query sequence according to a measure of similarity; a file server connected to said storage area network and to said plurality of alignment nodes, the file server being operable to obtain from the storage area network and to forward to an alignment node sequence data from said database in response to a request made by an alignment node; and a head node connected to said file server and to each of said alignment nodes; the head node being operable to receive an initial query sequence, to instruct each of one or more of the alignment nodes to carry out an alignment of said initial query sequence against a part or the whole of said database to obtain one or more result sequences which resemble the initial query sequence according to a measure of similarity; the head node also being operable to receive result sequences from the alignment nodes, and to instruct each of one or more of the alignment nodes to carry out an alignment of received result sequences against a part or the whole of the database to obtain one or more further result sequences. The invention also relates to the result sequences obtained using the method, computer algorithm and/or apparatus of the invention. These result sequences may be in analog, digital, alpha- numerical or any other suitable form (such as in the form of a database) ; and may be present on a suitable data carrier, such as paper or a floppy disk, disk drive, CD-ROM or another computer readable/writeable medium. The invention also relates to representations of the result sequences, such as alpha-numerical and/or graphic presentations (e.g. in the form of a list, table, graph or figure) , which may again be present on a suitable data carrier as described above. Data carriers comprising the result sequences of the invention (and/or representations thereof) are also within the scope of the invention.

Embodiments of the invention will now be described, by way of example only, and with reference to the accompanying drawings of which:

Figure 1 is a flow diagram illustrating a recursive alignment method embodying the invention;

Figure 2 illustrates a data entry screen for controlling a computer program carrying out the method illustrated in figure 1;

Figure 3 illustrates data output by a computer program carrying out the method illustrated in figure 1, during and after completion of analysis of a biochemical sequence database;

Figure 4 illustrates data output in tabular form by a computer program carrying out the method illustrated in figure 1; and

Figures 5 and 6 illustrate computer systems for carrying out analysis of a biochemical sequence database using methods such as that illustrated in figure 1.

Figure 1 is a flow diagram illustrating a method for seeking, in a biochemical sequence database 10, result sequences 12 which resemble an initial query sequence 14 according to a measure of similarity 16. An alignment process 18 is provided which is operable to compare each of one or more biochemical query sequences contained in or referenced by a query list 20 with target sequence data contained in the database 10. Each segment of the target sequence data which resembles a query sequence according to the measure of similarity 16 is identified as a result sequence 12. Each such result sequence is recorded in the form of a reference or index into the database 10. Result sequences are output (22), along with relevant associated data such as corresponding query sequences and iteration levels at which they were found, and the expectation or probability measures of the alignments. Those result sequences which have not previously been used as query sequences are used to form a new query list 20 which is passed to the alignment process. The method iterates, generating new result sequences and carrying out alignment of the new result sequences against the database, until either no new result sequences are found, or another stop condition is met. A simple and robust stop condition is the number of iterations, or the total number of query sequences, reaching a preset maximum. Equally, the stop condition could be implemented as a threshold number of distinct result sequences being reached.

Although illustrated as a single process element, the alignment process 18 will typically be provided by a plurality of instantiations of the same or similar software, preferably executing on multiple processors. The required number of instantiations can be started as required, each instantiation typically processing one query sequence only, and if required, one sequence only against only a part of the database 10. The alignment process may also be carried out, in whole or in part, by appropriate computer hardware, such as programmable logic arrays or dedicated logic.

In preferred embodiments, the alignment process is provided by a well-known alignment tool, such as a BLAST-program, for example the versions of BLAST available from the NCBI (NCBI-BLAST see: http: //www. ncbi .nim.nih. αov/BLAST/ , for example version 2.x) or from Washington University (WU-BLAST) . A variety of other biochemical sequence alignment tools are known and could easily be used by the person skilled in the art. Such tools may usually be invoked in computer software using well known scripting languages such as PERL, and the location of the relevant target database and query sequence files passed as command line arguments. Of course, the target database and query sequence files must contain data in a format understood by the alignment tool. "WU-BLAST" version 2.0 accepts data in the biochemical "extended database format" (XDF) and in NCBI BLAST database formats, among others. The method illustrated in figure 1 is preferably implemented using a script language, such as PERL, implemented on a UNIX workstation or group of UNIX workstations on which the alignment tool is installed for execution, and which have access to a data store or stores containing the biochemical target sequence data 10. Software written to carry out the method preferably also provides suitable graphical and/or textual interfaces to allow a user to control the software, as will now be discussed with reference to figures 2, 3 and 4.

Figure 2 illustrates an input screen generated and controlled by a suitable PERL script executing on a computer workstation. Three "Select Databases" tick boxes 30 allow the user to select combinations of three predefined databases. The "worm" database contains a comprehensive set of polypeptide data relating to the nematode worm C. elegans. Some of this polypeptide data has been carefully reviewed, checked and annotated, while some has been automatically derived from available polynucleotide data with less complete review. The "fly" and "man" databases are similar in nature to the "worm" database, but contain sequence data relating to Drosophila melanogaster and Homo sapiens respectively. The user, in this example, has selected both the "worm" and "fly" databases.

The input screen also provides an initial query name input field 32. The expression "ace-1" which has been typed by a user selects as an initial query sequence the protein acetyl cholinesterase-1, which is set out at a known location in the C. elegans "worm" database. Instead of typing a query sequence name in the input field 32, a user can type or paste an explicit initial query sequence into a paste box 34.

An input cut-off field 36 accepts an expectation parameter for use as a similarity measure 16 in assessing whether a target sequence in the database resembles a query sequence. Equally, a probability parameter, or other parameter or combination of parameters could be used. Facilities to edit parameters for controlling the mode of operation of the alignment process 18 in comparing query sequences with target sequences in the database, such as gap penalty parameters, could also be provided by the input screen of figure 1, if required, instead of being controlled solely by defaults set in the controlling PERL scripts. An iterations field 38 allows the user to modify the stop condition used to prevent the recursive alignment process from proceeding too far. In figure 2, this field has been set to "until convergence", indicating that the method should continue to iterate until no new result sequences are generated. A number 5 entered in this field would instruct the software to carry out five iterations of the recursive alignment process before stopping, unless the method stopped before this because no new result sequences were available as query sequences.

A submit button 40 and a reset button 42 allow the user to begin the recursive alignment process illustrated in figure 1 using the selected parameters, or to clear the fields of the input screen.

Figure 3 shows data output by a PERL script using an alignment tool such as WU-BLAST 2.0 to carry out the recursive alignment process of figure 1. The alignment of the initial query sequence against the specified target database or databases is set out in the section headed "Iteration 0" (50) . In the present exemplary embodiment the databases selected by the user in the input screen of figure 2 are effectively combined for the purposes of the alignment process, so that each query sequence is aligned against all selected databases. The query name "ace-1" input by a user as shown in figure 2 is represented in the output of figure 3 by an accession number W09B12.1 into the database. The output of iteration 0 (50) indicates that the sequence indexed by W09B12.1 is fetched and compared to the target sequence data in the selected databases ("blasting W09B12.1"). The output of iteration 1 (52) of the recursive alignment process indicates that two result sequences found to resemble W09B12.1 in iteration 0 (50), according to the expectation value set in the input screen of figure 2, were each compared to the target sequence data in the selected databases. These two new query sequences are indexed by AAF54915.1 and Y48B6A.8.

The output of iteration 2 (54) of the recursive alignment process indicates that two new result sequences found to resemble either or both of

AAF54915.1 and Y48B6A.8 in iteration 1 (52) were used as query sequences in iteration 2 (54). Although not apparent from this figure, sequence AAF5491.5 was also established as a result sequence, resembling Y48B6A.8, in iteration 1. However, AAF5491.5 was automatically excluded from the query sequences of iteration 2 because it had already been used as a query sequence in iteration 1.

No iterations are indicated beyond iteration 2 (54) and the next block of output is a "hitlist" (56) headed by a statement indicating that the process has "converged", that is, no new result sequences, and therefore no potential new query sequences were found during iteration 2 (54). The accession numbers of all four result sequences and the original query sequence are listed in the hitlist (56) , each appended by a bracketed tag, " (Dm) " indicating a sequence from the "fly" database and "(Ce)" indicating a sequence from the "worm" database.

In figure 4 the results of the iterative alignment process are set out in a tabular form. The first column (60) lists the initial query sequence and the result sequences, each appended by a bracketed integer indicating in which iteration of the alignment process the sequence was first found. Each subsequent column is headed by a relevant query sequence, underneath which is listed the expectation numbers of all alignments found of the particular query sequence with any of the result sequences listed in column 1. It will be seen that AAF54915.1 was found twice, once using query sequence W09B12.1 and once using query sequence Y48B6A.8. The shading of each box of the table containing an expectation number reflects the magnitude of the expectation number. A smaller magnitude indicates a greater chance of an alignment having biological relevance.

A number of variations in the recursive alignment process described above will now be discussed. In the foregoing example each alignment of a query sequence was carried out on any and all databases selected by the user in the input screen of figure 2.

Effectively, all selected databases were combined into a single database although, of course, any one alignment exercise could easily be split up into a number of separate alignment exercises, for example to be carried out by a number of parallel computer processors, each alignment exercise being carried out on a different part of the search data.

In an alternative embodiment, a query sequence is used in alignment only against target sequence data relating to a different organism type to that of the query sequence. This can either be effected for particular chosen parts of the recursive alignment process, or can be carried out through the whole process with all alignments of each iteration being carried out on sequence data of a single organism type, and all query sequences of that iteration relating to another organism type. If only two organism types or species are represented in the target sequence database or databases, then the organism type of the target data will alternate from iteration to iteration. Although in the described embodiment a single measure of similarity was used throughout a whole recursive alignment process, the measure of similarity could vary, depending on, for example, the iteration level and the organism type of the target sequence data.

An apparatus suitable for carrying out the recursive alignment process, and similar alignment processes, will now be described with reference to figure 5. An input element 102 provides one or more query sequences 103, 104 to control logic 105. The control logic 105 arranges for an alignment engine 106 to carry out sequence alignment of the one or more query sequences against sequence data held in a data store 108. Result sequences are stored in a results store 110. The control -logic 105 accesses the results store 110 to select new query sequences to pass to the alignment engine 106, and to pass results to the output 112. The alignment engine 106 may be implemented as a number of parallel processes.

The described apparatus will typically, although not necessarily, be implemented on one or more computer nodes, for example as shown in figure 6 and as discussed below. The input 102 may comprise a user interface enabling a user to type in one or more initial query sequences, or to select such initial query sequences from a data store. The output 112 may comprise a user interface or display, and may also provide means for writing the results to a removable medium such as a CD ROM 114, and means for further analysis of the results.

The control logic 105 implements control of what data from the data store and which query sequences are used for each alignment instance or iteration, and may, for example, select data from one or both of a first organism database 116 and a second organism database 118 held in the data store 108 as required. The control logic also implements a stop condition to control the number of iterations, logic to prevent the same query sequence being used more than once in a single iterative process, any required variations in the measure of similarity, and selection of result data to pass to the output 112.

A further computer system suitable for carrying out the recursive alignment process described above, and similar alignment processes, will now be described with reference to figure 6. This system may be combined with the system shown in figure 5, for example by implementing instances of the alignment engine 106 in the various nodes, and the data store 108 in the file server and storage area network. A storage area network 170 comprising a number of computer units each provided with a large amount of non-volatile data storage capacity is connected to a node network 172 by one or more file servers 174. Biochemical sequence data is stored by the storage area network 170. Subsets of the data are made available to the node network 172 on request. Such subsets may represent the whole or a part of the stored data relating to a particular organism type or other biochemical grouping, or to data relating to more than one organism type.

A head node 176 located on the node network 176 runs a job scheduling system enabling it to allocate alignment exercises to each of a plurality of subnodes including heavy nodes 178 and light nodes 180. Heavy nodes are more powerful in terms of execution speed and throughput of alignment tasks, and consequently more expensive than light nodes. The provision of two or more different types of subnode in a heterogenous cluster allows cost and performance to be balanced as required, and reduces the cost of achieving a given level of performance when carrying out a recursive alignment . The job scheduling system may be implemented using a variety of known software packages such as Open PBS ("Portable Batch System") . PERL scripts or similar software may be used on the head node 176 to direct the recursive alignment process largely as described above. Jobs involving the alignment of one or more particular query sequences against a selected part of the sequence data stored in the storage area network are passed to the subnodes by the scheduling system. The subnodes are able to request data from the file server 174 or file servers without referring back to the head node 176. Alignment results obtained by the subnodes are passed back to the head node 176 which may then use some or all of these result sequences as query sequences for further alignment jobs passed to subnodes. In this way, the recursive alignment exercise described above with reference to figures 1 to 4 is carried out by a plurality of computer nodes working in parallel.

Claims

1. A method of analysing a biochemical sequence database comprising the steps of: (a) providing an initial query sequence;

(b) carrying out an alignment of the query sequence against the database to establish result sequences which resemble the query sequence according to a measure of similarity; and (c) if any result sequences are established and unless a stop condition is met, automatically repeating steps (b) and (c) using each of one or more of said result sequences as a query sequence.

2. The method of claim 1 wherein the database comprises sequence data relating substantially to only a single organism type.

3. The method of claim 1 wherein the database comprises substantial proportions of sequence data relating to each of two or more organism types.

4. The method of claim 3 wherein one or more of the alignments of steps (b) and (c) are carried out on sequence data relating to all of the two or more organism types.

5. The method of claim 3 or 4 wherein one or more of the alignments of steps (b) and (c) are carried out on database sequence data relating to only a single organism type.

6. The method of claim 5 wherein one or more of the alignments of steps (b) and (c) are carried out on database sequence data- relating to a different organism type to that of the query sequence for the particular alignment.

7. The method of any preceding claim wherein steps (b) and (c) are repeated using as query sequences, in respect of target database sequence data relating to a particular organism type, only those result sequences which have not previously been used as query sequences in respect of target database sequence data relating to that organism type.

8. The method of any preceding claim wherein the stop condition includes the number of levels of repeats of steps (b) and (c) of the method being equal to or exceeding a threshold.

9. The method of any preceding claim wherein the database comprises at least one of polynucleotide sequence data and polypeptide sequence data.

10. A computer program product comprising computer program instructions to control a computer to carry out the method steps of any of claims 1 to 9.

11. A computer readable medium carrying a computer program product according to claim 10.

12. Apparatus comprising elements adapted to carry out the method steps of any of claims 1 to 9.

13. Apparatus for analysing a biochemical sequence database comprising: a data store holding said database; an input arranged to provide an initial query sequence; an alignment engine arranged to carry out an alignment of a query sequence against the database to establish result sequences which resemble the query sequence according to a measure of similarity; and control logic arranged to pass said initial query sequence to said alignment engine, and to subsequently and iteratively pass selected ones of said result sequences to said alignment engine, if any such result sequences are established and until a stop condition is met.

14. The apparatus of claim 13 wherein the database comprises sequence data relating substantially to only a single organism type.

15. The apparatus of claim 13 wherein the database comprises substantial proportions of sequence data relating to each of two or more organism types.

15. The apparatus of claim 15 arranged such that one or more of the alignments are carried out by the alignment engine on sequence data relating to all of the two or more organism types.

17. The apparatus of claim 15 or 16 arranged such that the alignment engine carries out one or more alignments on database sequence data relating to only a single organism type.

18. The apparatus of claim 17 arranged such that the alignment engine carries out one or more alignments on database sequence data relating to a different organism type to that of a query sequence for the particular alignment or alignments.

19. The apparatus of any of claims 13 to 18, wherein the control logic is arranged to pass a query sequence to the alignment engine only if that query sequence has not previously been used as a query sequence within the same iterative alignment process.

20. The apparatus of any of claims 13 to 19 wherein the control logic is adapted to apply a stop condition which includes a number of iterations of an alignment process being equal to or exceeding a threshold.

21. The apparatus of any of claims 13 to 20 wherein the database comprises at least one of polynucleotide sequence data and polypeptide sequence data.

22. A computer program product comprising computer program instructions to control a computer to implement the control logic of any of claims 13 to 20.

23. A computer system for carrying out analysis of a biochemical sequence database, comprising: a storage area network adapted to store the database; a plurality of alignment nodes, each alignment node being operable, in response to an instruction, to carry out the alignment of a query sequence against at least a part of said database to obtain one or more result sequences which resemble the query sequence according to a measure of similarity; a file server connected to said storage area network and to said plurality of alignment nodes, the file server being operable to obtain from the storage area network and to forward to an alignment node sequence data from said database in response to a request made by an alignment node; and a head node connected to said file server and to each of said alignment nodes; the head node being operable to receive an initial query sequence, to instruct each of one or more of the alignment nodes to carry out an alignment of said initial query sequence against a part or the whole of said database to obtain one or more result sequences which resemble the initial query sequence according to a measure of similarity; the head node also being operable to receive result sequences from the alignment nodes, and to instruct each of one or more of the alignment nodes to carry out an alignment of a received result sequence against a part or the whole of the database to obtain one or more further result sequences.