WO2002014872A2

WO2002014872A2 - Sequence data preparation method and apparatus

Info

Publication number: WO2002014872A2
Application number: PCT/BE2001/000135
Authority: WO
Inventors: Joël VANDEKERCKHOVE; Kris Gevaert; Gregoire Thomas; Nikos Berdenis
Original assignee: Vlaams Interuniversitair Instituut Voor Biotechnologie Vzw
Priority date: 2000-08-14
Filing date: 2001-08-14
Publication date: 2002-02-21
Also published as: AU2001285618A1; WO2002014872A3; GB0019817D0

Abstract

A method for analyzing a cleaved unifentified biomolecule, the method using a library of known sequences of biomolecules, the method comprising the steps of: performing a simultated cleavage on a library of known sequences of biomolecules, generating a second set of data structures representing the cleaved subsequences applying a function to the generated subsequences in the second set of data structures, receiving experimental results of the operation of the cleaving agent on the unidentified biomolecules; conparing the one or more values with the received results; and selecting and outputting at least one known sequence of the first set of data structures in response to the results of the comparison step. Suitable apparatus for carrying out the method is also described.

Description

SEQUENCE DATA PREPARATION METHOD AND APPARATUS

FIELD OF THE INVENTION

The present invention relates to the preparation of data structures useful in the analysis and identification of unknown molecules such as biomolecules, in particular biopolymers. In particular, the present invention is related to the preparation of data structures derived from sequence information of peptides, proteins, oligonucleotides and oligosaccharides as part of the correlation of peptide fragments or peptide fragmentation patterns obtained experimentally, with those derived theoretically from the sequence information available in a database. The present invention also includes the use of a programmable, general purpose or dedicated computer or workstation to analyze and identify biomolecules based upon experimentally determined spectra of the biomolecules.

TECHNICAL BACKGROUND

The introduction of soft ionization techniques has opened the possibility of analyzing large intact biomolecules, such as peptides, proteins, oligonucleotides and oligosaccharides, by mass spectrometry (Macfarlane and Torgerson in Science 1976, 191 :920-925.; Barber et al. in J. Chem. Soc. Chem. Comm. 1981 , 7:325-327.; Karas and Hillenkamp in Anal. Chem. 1988, 60:2299- 2301 and Fenn et al. in Science 1989, 246:64-71 ). In particular, but not exclusively, in proteomic research (Wasinger et al. in Electrophoresis 1995, 16:1090-1094).

These mass spectrometric techniques have proven to be essential tools for fast, sensitive and reliable identification of proteins, of which the sequences are fully (or partly) available in sequence databases (reviewed by Roepstorff in Curr. Opin. Biotechnol. 1997, 8:6-13. and Yates in J. Mass Spectrom. 1998, 33:1-19.). Such sequence databases are typically composed of protein sequences, protein fragment sequences and DNA sequences (genomic or cDNA), the latter of which can be translated into aminoacid sequences.

Two different types of mass spectrometric data are generally used for protein identification. In the first approach, the isolated protein of interest is enzymatically or chemically cleaved at specific and predictable positions of its sequence (e.g., hydrolyzed by trypsin at the COOH terminal side of Lysine and Arginine residues). Then a mass spectrum, representative of the masses of the peptides present in the mixture, is generated. Such mass spectra are usually described by a histogram displaying a number of peaks, each characterized by its mass to charge (m/z) ratio and intensity value. Using the same cleavage specificity, a list of peptide masses is then generated either from sequences present in a protein database or from those of a DNA (EST or genomic) database. The list of the experimentally derived peptide masses is subsequently compared with the generated list and the protein attaining the highest matching score is considered as the best available identified candidate. This approach, namely using the set of peptides obtained by specific cleavage of a protein to query the available databases, is referred to as peptide mass fingerprinting (PMF) Cottrell in Pept. Res. 1994, 7:115-124.

In the second approach, protein identification proceeds with information from a single selected peptide. This peptide is either isolated by conventional chromatographic techniques or selected in the mass spectrometer on the basis of its specific m/z value. The latter is achieved by using specific devises (such as ion gates) or by using quadrapoles, which are commonly used in the field of mass spectrometry.

The results of further fragmentation, spontaneous or induced by controlled collision with atoms of a noble gas, can be analyzed employing, for instance, a reflectron field, a second quadrupole or an ion trap. This technique, known as tandem MS or MS-MS, enjoys widespread application and is used in conjunction with both the electrospray ionization mode and the matrix-assisted laser absorption ionization mode (Hunt et al. in Proc. Natl. Acad. Sci. USA 1986, 83:6233-6237; Spengler et al. in Rapid Commun. Mass Spectrom. 1992, 6:105-108.). Again, such fragmentation spectra can be described by a histogram with a number of peaks, characterized by mass to charge ratios (m/z) and intensity values. However, these spectra, in contrast to the spectra mentioned earlier (containing peptide masses derived from enzymatic or chemical cleavages), contain masses derived from fragments of selected peptides. Here too, the theoretical fragmentation patterns of each of the candidate peptides are compared to the peptide spectrum obtained experimentally.

US 6,017,693 and 5,538,897 describe the use of peptide fragmentation patterns obtained by mass spectrometric techniques to identify amino acid sequences in databases. A peptide is analyzed in a tandem mass spectrometer to yield an experimental peptide fragmentation mass spectrum. Sequences available in protein or oligonucleotide databases are then used to select one or more peptides with a mass substantially close to the experimentally obtained peptide mass. Theoretical fragmentation patterns from each of these candidate peptides are generated and then compared to the experimental peptide fragmentation spectrum.

Candidate peptides are determined starting from a target mass and the sequences available in databases. The masses of linear stretches of amino acids (as present in proteins or translated oligonucleotide sequences) are summed until the mass of the generated peptide is within tolerance of the target mass or has exceeded it. If the calculated mass is within tolerance, the sequence is marked as a candidate sequence. If the calculated mass is outside the tolerance, the calculation procedure is started again beginning at the next amino acid position in the sequence.

The known search algorithms are computationally expensive, particularly when the database is large. Proposals for reducing the computational load (and mainly increase speed of execution) include:

1 ) A directed search, whereby additional information about the protein of which the peptide comes from or the organism out of which the protein was isolated, may be used to reduce the search space in the database used.

2) Filtering the database: e.g., an initial search with a protein molecular weight window and a reduced set of fragment ions.

3) Multiplexing the analysis of multiple MS/MS spectra in one pass through the database.

4) Calculating all the protein fragments (peptides) and storing these in a database. Comparison with test values then only requires a comparison rather than calculations.

All of the known techniques have disadvantages. Reducing the search space, e.g., by limiting the search to specific organisms, may fail to identify homologous proteins in other species (cross-species identification).

Since the introduction of the World Wide Web, services have been offered on the Internet which run on computing systems which can be optimized for the application. However, for reasons of privacy, security and convenience, it would be advantageous to be able to carry out protein or gene identification local to the experimental laboratory, especially on a personal computer. The main protein or gene databases are increasing in size rapidly and processing the complete database each time on a personal computer can take a long time.

It is an object of the present invention to provide a method and apparatus for identifying proteins, peptides, oligonucleotides or other molecules, especially biomolecules or biopolymers from mass spectra or other data with a reduced computational load and a high specificity.

It is also an object of the present invention to provide a database residing as a stored data structure which can be used to identify proteins, peptides or other molecules, especially biomolecules or biopolymers from mass spectra or other data, with a reduced computational load and a high specificity.

It is still a further object of the present invention to provide a method for identifying proteins, peptides, oligonucleotides or other molecules, especially biomolecules or biopolymers from mass spectra or other data which can be economically executed on a personal computer.

SUMMARY OF THE PRESENT INVENTION

The present invention provides in one aspect a computer executed method for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the method using a library of known sequences of biomolecules, the method comprising the step of: obtaining from the library a first set of data structures representing known sequences of biomolecules; generating a second set of data structures representing subsequences of at least selected ones of the known sequences by simulated action of the cleaving agent on at least selected ones of the known sequences in the first set of data structures; applying a function to the generated subsequences in the second set of data structures, the function predicting a value of a physical attribute of each subsequence; generating an identifier for each subsequence, sorting the identifiers of the subsequences in accordance with the value of the physical attribute associated therewith, receiving experimental results of values of the physical attribute of molecules generated by an operation of the cleaving agent on the biomolecule; using the received results to select out a subset of the subsequence identifiers based on the values of the physical attribute of the received results; and selecting and outputting at least one known sequence of the first set of data structures by comparing values of the physical attribute of the experimental results with the values of the physical attribute of the subset.

The present invention provides in another aspect a method for preparing a computer system having a processor and a data storage device for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the method using a library of known sequences of biomolecules, the method comprising the step of: generating a plurality of subsequences from the known sequences in the library by simulated action of the cleaving agent on at least selected ones ofthe known sequences in the library, and associating each subsequence with an identifier, for each subsequence calculating a value of a corresponding physical attribute of that subsequence, sorting all the plurality of subsequence identifiers using the value ofthe corresponding physical attribute as the sorting criterion, and storing all the plurality of subsequence identifiers and values of the corresponding physical attributes in the sorted condition, including storing identifiers of subsequences and the corresponding values of the physical attribute which have the same value of the corresponding physical attribute if these are present.

The biomolecule may be one of a peptide, a protein, an oligonucleotide and an oligosaccharide. The computer or computer system may be a personal computer, e.g. a computer having a motherboard on which is mounted a microprocessor. The microprocessor may be a 64 bit or less, e.g. a 32 bit processor having Random Access Memory and preferably non-volatile storage, e.g. a hard disc, writable CD-ROM drive.

The small and efficient data structures of entities in databases of the present invention determine the efficiency of the use of the databases on computers and in transmission of the databases over bandwidth-limited media such as the Internet. The method may comprise the step of storing the indicators of the plurality of generated subsequences and attributes of the subsequences as data structures. In particular, the indicators of the plurality of generated subsequences and their attributes may be stored in the mass storage unit of a computational device such as a general purpose or dedicated personal computer, server or workstation or an intermediate node of a telecommunications network. Preferably, taxonomic and/or lexicographic information is stored with each subsequence as an additional information data structure, this additional data structure relating to the sequence from which the subsequence has been derived. The subsequences themselves may be stored as well but the present invention is not limited to this.

The method may also comprise the step of analyzing at least one attribute of the subsequences of the selected ones of the sequences in order to rank the selected sequences based on a matching criterion with respect to the unidentified molecule. The analyzing step may include comparison of predicted masses of the molecules represented by at least selected ones from the plurality of subsequences with the masses determined from results of the experimental cleaving of the biomolecule. Such a step may be used to generate a subset of subsequences, namely those subsequences which match on mass. This subset is more compact than the complete set of subsequences derived from the sequence data in the sequence database. Such a step may be part of peptide MS fingerprinting. The analyzing step may also include calculating predicted mass spectra from at least selected ones of the plurality of subsequences and comparing the predicted mass spectra with mass spectra of the molecules resulting from the experimental cleaving of the biomolecule. Such a step may be part of MS/MS.

The selected ones of the known sequences in the library may be determined in accordance with at least one selection parameter, for example a taxonomic or a lexicographic parameter. The selected ones of the plurality of subsequences may be determined in accordance with at least one selection parameter, for example, a taxonomic or a lexicographic parameter of the sequence from which the relevant subsequence is derived, or C-terminal amino acid information or amino acid content of any one of the plurality of subsequences.

The simulated cleavage step can also allow at least one missed cleavage.

The present invention also includes a program product, executable on a computer, for executing any of its methods. The computer program product may be stored on a computer readable data carrier such as a CD-ROM or similar. The present invention may provide a The computer may be a personal computer, e.g. a computer having a motherboard on which is mounted a microprocessor. The microprocessor may be a 64 bit or less, e.g. a 32 bit processor.

The present invention may also provide a computer based system for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the computer based system having a processor and a data storage device and access to a library of known sequences of biomolecules, the apparatus comprising: means for generating a plurality of subsequences from the known sequences in the library by simulated action of the cleaving agent on at least selected ones of the known sequences in the library, means for associating each subsequence with an identifier, means for calculating for each subsequence a value of a corresponding physical attribute of that subsequence, means for sorting all the plurality of subsequence identifiers using the value of the corresponding physical attribute as the sorting criterion, and means for storing all the plurality of subsequence identifiers and corresponding values of the physical attribute in the sorted condition, including storing identifiers of subsequences and the corresponding values of the physical attribute which have the same value of the corresponding physical attribute if present.

The present invention also includes a computer based system for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the computer based system having access to a library of known sequences of biomolecules, the system comprising: means for obtaining from the library a first set of data structures representing known sequences of biomolecules; means for generating a second set of data structures representing subsequences of at least selected ones of the known sequences by simulated action of the cleaving agent on at least selected ones of the known sequences in the first set of data structures; means for applying a function to the generated subsequences in the second set of data structures, the function calculating a value of a physical attribute of each subsequence; means for generating an identifier for each subsequence, means for sorting the identifiers of the subsequences in accordance with the values of the physical attribute associated therewith, means for receiving experimental results of values of the physical attribute of the molecules generated by an operation of the cleaving agent on the biomolecule; means for using the received results to select out a subset of the subsequence identifiers based on the values of the physical attribute of the received results; and means for selecting and outputting at least one known sequence of the first set of data structures by comparing the values of the physical attribute of the experimental results with the values of the physical attribute of the subset.

The present invention may also provide method for analyzing spectra of an unidentified biomolecule determined experimentally from cleavage of the molecule by a cleaving agent, the method using a library of known biomolecules, the method comprising the steps of: inputting at a near location the spectra relating to the unidentified molecule; transmitting the spectra to a remote location having a processing engine for executing the method in accordance with the present invention; receiving at a near location a result of the method in accordance with the present invention. The far location may be a server and the method of transmitting the spectra may be the use of a telecommunications network such as the Internet.

The present invention may provide biomolecule analysis tool implemented on a computational device including a memory, the tool being of the type which accesses a library of known sequences of biomolecules, the tool comprising: a data structure within the memory which comprises a plurality of identifiers of subsequences, the subsequences being those obtained by simulated action of the cleaving agent on at least selected ones of the known sequences in the library, each subsequence identifier being associated in the data structure with a corresponding value of a physical attribute of that subsequence, the data structure including identifiers of subsequences which have the same value of the corresponding physical attribute if these are present, and the data structure being in a sorted condition having been sorted using the corresponding physical attribute as the sorting criterion.

The tool preferably comprises means for calculating predicted masses of molecules represented by the plurality of subsequences. The computational device may, for instance, be a general purpose or dedicated personal computer, a workstation, a server. Such devices generally comprise a microprocessor, which can access the memory. The microprocessor may be located on a motherboard.

The present invention may provide a method for the processing of the sequence content (as well as of its accompanying information, e.g., headers) of a first database and then using the results formatted in a data structure that increases dramatically the performance of identification or search procedures. The processed data may be stored in the form of a data structure, which represents a second database. The processing of the first database to generate the second database need only be performed when an update of the first database has been necessary. The first database may include sequences or data concerning chemical molecules such as peptides, proteins, oligonucleotides and oligosaccharides. Simulating the action of a cleaving agent generates the fragments used to generate the data structures of the second database. This database represents all possible peptide fragments that would have been obtained had the molecules (represented by the sequences) been actually cleaved by the cleaving agent. The fact that two fragments have the same physical attribute such as mass does not mean that a fragment is discarded. The data structures representing the fragments may be compared to data on fragments obtained by the experimental application of the same cleaving agent to an unknown molecule. For instance, all possible peptide fragments may be obtained from the protein sequences in the first database by simulated application of a cleavage protein to the sequences in first database. Representations of these fragments in the form of attributes may then be stored in a second database, indexed in accordance with at least one of a variety of attributes, e.g., based on at least their mass. Optionally, the sequences of these fragments may also be stored.

The whole of the first database may be treated in this way; hence, there is no loss of information. A plurality of second databases may be formed, each one generated by the simulated application of a different cleaving agent.

The improvement in the speed when analyzing a biopolymer, compared with existing methods, relies mainly on the indexing of the peptides. Indeed, selections may be made from the second database in order to speed up identification without loss of accuracy. For example, an identification method may automatically operate only on all the viable (with respect to mass or another attribute) peptide candidates present in the second database, which are likely to be produced by the actual application of the cleavage protein to the unknown protein. This is possible because of the mass-based indexing performed earlier. The method assumes that the protein or a homologue thereof is present in the first database.

Furthermore, use may be made of the fact that each peptide may be linked to additional information accompanying the original sequence, e.g., source organism, source protein. This additional information allows further peptide filtering (resulting in faster processing) through incorporation of additional criteria. A non-limiting list of admissible criteria may include: prior user knowledge about the sequence under examination and/or specific features of the experimental spectrum. The employment of such a variety of criteria may lead to a significant improvement in the accuracy of the predictions as well as the speed of reaching them. Both MS fingerprinting and MS/MS fragmentation spectrum analysis and a combination of the two, may be used with the database scheme in accordance with the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a schematic representation of the first and second databases in accordance with an embodiment of the present invention.

Fig. 2 is a flow diagram of the use of taxonomic and lexicographic for sequence selection in accordance with an embodiment of the present invention.

Fig. 3 is a flow diagram of MS fingerprinting in accordance with an embodiment of the present invention.

Fig. 4 is a flow diagram of MS/MS in accordance with an embodiment of the present invention.

Fig. 5 is a schematic representation of a computer system for use with the present invention.

DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS OF THE PRESENT INVENTION

The present invention will be described with reference to certain embodiments and to certain figures but the present invention is not limited thereto but only to the claims.

The application of mass spectrometry in protein identification studies deals essentially with specific peptide or peptide fragment masses observed in an experimental mass spectrum.

Cleavage of proteins results in the formation of peptides. Accurate mass analysis of a sufficient number of these peptides can lead to the identification of the starting protein provided that its sequence is available in sequence databases (a technique known as peptide mass fingerprinting). In many cases, experimental peptide mass fingerprints do not lead to unambiguous protein identification. For this reason, fragments from a selected peptide are generated in a mass spectrometer and the obtained fragmentation spectrum may be used to identify its parent peptide (and thus protein) in a protein or oligonucleotide sequence database.

The peptides mentioned above are derived, in accordance with the present invention, after treatment of the protein under investigation with a cleaving agent. The peptide sequences, which are stored in a second database, are in sillco calculated from the protein sequences available in the first database using the same cleaving agent (and thus cleavage rules) employed in the experiment.

A non-limiting list of examples of cleaving agents suitable for use with the present invention is: enzymatic cleaving agents, for example proteases such as trypsin, chymotrypsin, endoproteinase Lys-C, or similar. chemical cleaving agents, for example protein cleaving agents such as cyanogen bromide or similar. The present invention allows both complete as well as incomplete protein cleavage.

Methods used in protein identification studies, in accordance with the present invention, have as input information obtained from experimental mass spectra; for example the m/z-values of (protein or peptide) fragments as well as their intensity values. In general, the methods of the present invention compare the experimental mass spectrum with computed (theoretical) mass spectra of a restricted number of candidates (originating from protein or oligonucleotide sequences stored in sequence databases) determined by the simulated application of the relevant cleaving agent to the sequences stored in the first database. The representations of the cleaved fragments are stored in the second database. The most likely candidate is selected based on a match- scoring algorithm or a similar fitness criterion.

Methods in accordance with the present invention run preferably on a processing engine such as a server, a workstation or a personal computer. These may be general-purpose devices or dedicated to the task concerned. The computational device may have standard hardware and software components including memory and an arithmetic logic unit. The memory may be provided in various forms e.g., a mass storage unit such as a hard disc, a CD-ROM, RAM. The databases, in accordance with the present invention, may be stored on a suitable data carrier such as a CD-ROM, a hard disk or similar in the form of a data structure. Each stored protein/oligonucleotide in the first database is examined by the methods in accordance with the present invention ^rand in one embodiment each sequence is operated by a simulated cleaving agent in order to generate a protein fragmentation pattern, derived at the runtime of the method.

In a further embodiment of the present invention, the identification procedure is freed from the generation of the cleaved peptides at runtime. Simulated protein cleavage into peptides is performed globally on the first sequence database, and the resulting peptide sequences and/or attributes of these sequences are then stored in the second database in the form of a data structures. Recalculation is only necessary for updates to the first sequence database and can be performed during periods of low computational load. The data structures, in accordance with the present invention, may be enriched in that the derived peptides may be indexed or sorted in the second database according to a suitable set of one or more parameters, e.g., by their mass, which is the most significant characteristic in mass spectrometric analyses. The mass can be at least one of average, monoisotropic, negative ion mode and positive ion mode mass. Multiple second databases may be generated by simulating, in silico, the application of other cleaving agents to the sequences stored in the first database, thus forming a set of data structures.

In this embodiment, unnecessary examination (and fragmentation) of each stored sequence is avoided and the method focuses, instead, directly on the cleaved peptides. Information with respect to the originating protein is stored, preferably alongside the peptide sequences, in specific format, such that, going back from the peptides to the original sequence, is straightforward.

The peptide-based data structures of the second database, of the above embodiments, facilitate the inclusion in search algorithms of selection criteria other than standard spectrum-derived information. In particular, such selection criteria may include one or more of the following: molecular weight, isoelectric point of the protein under investigation, species from which it was isolated and possible keywords e.g., referring to the (known) subcellular location of the protein. Use of these selection criteria may help to speed up significantly the search methods in accordance with the present invention and may significantly increase their specificity.

1. DESCRIPTION OF THE GENERIC FORMAT OF PROTEIN/DNA SEQUENCE DATABASES.

These (first) databases are generally available as a data structure in a text format such as the ASCII file format. However, the present invention is not limited thereto. The first database may be in any suitable form, e.g., three- dimensional representations of proteins or nucleotides, provided these can be manipulated and searched as required by the present invention.

Typically, each protein sequence (or oligonucleotide) is preceded by its header. The header of each sequence follows immediately after termination of the previous sequence (see Fig. 1A and Table 0). The header provides information about the sequence that follows. Its length varies. It starts usually with the name, a sequence of characters uniquely identifying the protein (or the oligonucleotide). It then may include optional information such as empirical names of the protein, the name of the organism from which is was isolated and sequenced, names of possible homologues, a brief description of its biological function, its mass or similar. The actual sequence that follows is given in lines of fixed length, which is usually file specific.

2. DESCRIPTION OF THE SECOND DATABASE FORMAT IN ACCORDANCE WITH AN EMBODIMENT OF THE PRESENT INVENTION

The new format in accordance with an embodiment of the present invention is comprised of one or more files or data structures. For example, starting from a known input file (Fig. 1A) from a first database as described above, a second database is created as a data structure or set of data structures. This second database may comprise a sequence information file (File 1 , Fig 1B), a first binary file and two additional binary files (Files 2 and 3, Fig. 1 B) per considered cleaving agent, are derived. These additional files are the peptide file index, File 3, and the peptide information file, File 2. An additional part of the second database structure (see Tables 4, 5 for a description of the 7 corresponding files) handles the information in the header of the sequences.

In what follows, the data structure of the second database is described in the case where the initial sequence file consists of protein sequences. When the initial file contains oligonucleotide sequences, such as genomic DNA sequences, an extension of the proposed structure may be used (described at the end of the document).

The data structures stored in all these files are presented in tables 0 to 5. Figs. 1A and 1 B show the main database file structures.

2.1 Sequence information file (File 1)

In this file, information about the sequences is stored - see Table 1. Each sequence of the original file (Table 0, Fig 1A) obtained or created from the first database is assigned an indexing number (sequence ID), which uniquely identifies the sequence. The sequence ID is just the order in which the sequence is stored in the original file. Information about each sequence is preferably standardized and is of equal byte size for each sequence. If this is not the case, it is preferred that the original file is first prepared with a suitable data preparation method. Saved information per sequence is preferably: sequence ID, and a set of attributes such as molecular weight and isoelectric point of the protein, offset of the sequence header in the input file and the length of the header.

This file will be accessed many times by any of the methods of the present invention. Its relatively small size allows it to reside in random access memory (RAM) of a processing engine throughout execution. In this way, numerous time-consuming I/O (input/output) processes (disk accesses) are avoided. The advantage of using this file lies in the fact that by knowing the sequence ID at most two I/O operations are needed to read the header or the actual sequence from the input file.

2.2 Peptide information file (File 2)

In this file, information about the results of the simulated cleaving process on the original sequences is stored - see Table 2. Hence, this file includes definitions of cleaved peptides generated by a simulated cleavage process. The definitions preferably describe attributes of the derived sequences. A different file is derived for each cleaving agent. Information about some or all the possible cleaved peptides, generated by simulated cleavage by the assumed agent on the sequences stored in the input file, is stored in this file. This information is preferably standardized in equal size byte "structures". In experimental practice, the use of a cleaving agent may fail to cleave at certain positions. This results in peptides longer in length than might be expected. By design, this has been incorporated in the second database by allowing up to a given number of missed cleavages when calculating the peptides, which result from the simulated action of the cleaving agent on the protein sequences in the first database. Because this leads to an increased number of calculated peptides, the number of these missed cleavage sites per generated peptide is preferably limited. For instance, missed cleavage sites may be allowed in the calculations depending on certain parameters: e.g., provided that the mass of the generated peptide lies within a predefined mass window and/or residue length.

Preferably, additional information, associated with the results of the simulated cleavage process, is stored. This additional information may include one or more of the following: The sequence ID of the originating sequence, the molecular weight of the peptide, the offset of the first residue of the peptide (or of its corresponding first nucleotide) within the originating sequence in the input file and the length of the peptide. Extra information may also be stored. This may include information referring to the presence of specific amino acids within the peptide sequence (e.g., the C-terminal amino acid and/or the presence of amino acids susceptible to any known post-translational modification) and, when necessary, a flag indicating the reading frame used for oligonucleotide translation.

These stored data structures, generated for each peptide, are preferably sorted or indexed according to at least one parameter or attribute. The preferred parameter is the molecular weight of the generated peptide. As a consequence, given a mass window for the peptides to be searched, a search will start by loading immediately into the RAM only the set of all suitable candidate peptides, the ones within the provided mass window. This keeps the I/O operations to a minimum.

It should be noted that given a molecular weight of a peptide it is likely that many peptides have that molecular weight. The presence of the sequence ID links the peptide to its "parent" sequence (in the input file) and to all the information associated with it (in the sequence information file). Also, the offset of the peptide in the sequence and its length allow the direct reading of the actual peptide sequence from the input sequence file.

2.3 Peptide index (File 3)

This file, see Table 3, allows fast retrieval of all the peptides from the peptide information file (File 2), which have a specific characteristic within a given window of values. A different file is derived for each cleaving agent and corresponds to a single peptide information file. The characteristic is, preferably, the molecular weight of the peptides. The current file also consists of data structures of standardized information. As mentioned earlier, it is advantageous to have the ability to identify all the derived peptides having a common characteristic or attribute, e.g., mass within a predefined window. A sufficiently small mass step (STEP) can be predefined as a default value, which allows the partitioning of the peptide mass range in (MAX-MIN)/STEP subunits, where MAX and MIN are the predefined maximum and minimum allowed masses.

What is stored in the file, for each mass subunit, is preferably the molecular weight of the peptide and the offsets in the peptide file index of the first and last occurrences of cleaved peptides having this particular molecular weight.

2.4 Taxonomic name / taxonomic ID table (Table 4)

Given a taxonomic name, Files 401 and 402 provide a corresponding taxonomic ID. Note that a single taxonomic ID can correspond to several names, for instance a single ID is associated with "human", "man" and "Homo sapiens". The table may be indexed with a hash table using names to allow rapid query.

2.5 Sequence ID / taxonomic ID table and index (Table 4) Files 403 and 404 allow rapid access to the array of taxonomic IDs corresponding to a sequence ID (see sequence information file 14 above). It stores a list of species and the corresponding lineage for each sequence. For instance if a sequence has the word "human" in its corresponding header, the IDs for eukaryotes, Metazoa, Chordata, Craniata, mammals, primates, Catarrhini, Hominidae and Homo sapiens will be stored in the sequence and may be indexed with a hash table on the sequence ID.

2.6 Keyword / offset table and index (Table 5)

Files 501 and 502 provide, given a keyword, a corresponding offset in the sequence ID/keyword table. The table may be indexed with a hash table using keywords to allow rapid query.

2.7 Sequence ID / keyword table (Table 5)

File 503 the list of sequence IDs corresponding to a keyword. It contains a list of sequence IDs for each keyword. This table is sorted by keyword and the offset given by the keyword / offset File 502.

3. APPLICATIONS

The installation and use of the second database structure in accordance with the present invention may run on a processing engine such as personal computer system, a workstation or a server. The size of the derived files is well within the capacity of conventional storage devices.

The present invention provides a variety of methods, executable on a processing engine, suitable to address specific problems in protein analysis and/or identification by accessing the second database. These methods include routines recognizing specific data structures, which will be described as selection methods in analogy with those methods used to generate the peptide databases, which may be called peptide-generating methods. The generation of the second database requires access only to the desired first protein (or oligonucleotide sequence) database. This access may be provided by any suitable means, e.g. via a LAN, a WAN, the Internet or locally from mass storage devices. The present invention may be implemented as a relational database environment instead of a flat file scheme.

3.1 Use of selection criteria

Selection criteria, mirroring prior knowledge about the sequence being investigated, may be used in any of the selection methods of the present invention. They allow the early rejection of many candidates thus increasing both execution speed and selectivity. The second database structure, in accordance with the present invention, preferably incorporates such knowledge.

The logical steps involved in the implementation of the selection criteria preferably correspond to a minimum number of time-consuming I/O operations. This constitutes one of the major advantages of the database scheme in accordance with the present invention. Generally, two distinct sets of selection criteria are defined: those referring to the protein itself and those referring to a particular peptide.

3.1.1 Sequence selection criteria

The present invention includes storing and use of a selection criterion that relates to each protein or oligonucleotide sequence used for generation of the second database. Suitable selection criteria include one or more of the following: the molecular weight of the parent protein, its isoelectric point and any information contained in the sequence header (species and keywords). For example, the sequence information file, file 1 , may store the molecular weight and the isoelectric point of each protein. Thus, a method in accordance with the present invention may approve/reject a protein just by knowing its sequence ID, whenever information about the mass or isoelectric point of the protein under investigation is available.

It needs to be emphasized that no access to the original data file (Fig 1A) is required. The structure of the sequence information file 1 allows direct access to the criteria-related information once the sequence ID is known. This is particularly useful in methods that go through the analysis of individual peptides, like MS/MS-based protein identification algorithms. The sequence ID of the parent sequence of a candidate peptide enables the algorithm to approve/reject it, with respect to the above selection criteria, before any retrieval (and/or further algorithmic processing) of the actual peptide sequence. This achieves a gain in speed and selectivity and the design of the second database may be adapted in accordance with the present invention to achieve this advantage.

3.1.2 Sequence selection criteria based on taxonomic and lexicographic restrictions

The present invention includes the reduction in the number of selected peptides by storing taxonomic and/or keyword information. A series of taxonomic names and keywords associated with logical operators (AND, OR, NOT) may be used to restrict selection. Only the peptides matching these criteria will be retained. The list of peptides derived from the selection procedure is reduced progressively by matching it to different user-defined criteria. Such a screening method may operate as follows (Fig. 2):

(1 ) Store the taxonomic IDs for each of the selected sequence IDs using the sequence ID/ taxonomic ID Files 403, 404 (step 101 ).

(2) Translate a user-defined taxonomic query (series of taxonomic names and Boolean operators) into a series of taxonomic IDs and Boolean operators by using the taxonomic name / taxonomic ID Files 401 , 402 (step 102).

(3) Match the translated query with the taxonomic IDs for each of the selected sequence Ids (step 103). Only the subset of the sequence IDs that fulfils the selection criteria is kept (step 104 - first subset).

(4) From a user-defined keyword query, a list of items (keyword + Boolean operator) is extracted (step 106).

(5) Each item is then processed sequentially with the following steps:

(a) Locate the list of sequence IDs corresponding to the keyword by using the keyword / offset using Files 501 , 502 (step 107)

(b) Match this list with the list of sequence IDs by using the sequence ID/ keyword, File 503 (step 108).

(c) Retain only the matching subset of sequence IDs - second subset.

(6) Return the subset of the peptide list that corresponds to the list of sequence IDs returned (second subset - step 109). 3.1.3 Peptide selection criteria

A method in accordance with the present invention may analyze fragmentation spectra of individual peptides (e.g., obtained by MS/MS or by PSD) and may include a straightforward use of selection criteria. In the peptide information file, file 2, the information associated with specific amino acid content can be used to decide the relevance of a candidate peptide before any spectrum analysis. For instance, following (automated) inspection of peptide fragmentation mass spectra, C-terminal amino acid information and specific amino acid content can be used as selection criteria.

3.2 Peptide mass fingerprinting and peptide fragment spectra analysis

The database-processing scheme in accordance with the present invention solves problems encountered in biological mass spectrometry. The present invention includes implementations of both peptide mass fingerprinting and MS/MS (or PSD) spectral analysis, based on the use of the second database (applicable for both protein and oligonucleotide sequence databases).

A combination of these two methods is also included in the present invention, which brings MS/MS to the aid of peptide mass fingerprinting if and when the selectivity of the latter is not satisfactory enough.

3.2.1 Peptide mass fingerprinting

The problem addressed by peptide mass fingerprinting is: given the masses of protein fragments (peptides) present in a mass spectrum of an experimentally cleaved protein, identify the protein from which the peptides originate and/or derive a scoring list of the most likely protein candidates by matching the experimental peptide mass pattern to expected corresponding patterns of protein/DNA sequences stored in a first database. A method in accordance with an embodiment of the present invention includes the following steps (Fig. 3):

(1) Create an array of integers (called a scoring array) equal to the number of the sequences stored in the first database (step 201 ). Initialize the elements of this array by setting them to zero. Each integer element of that array is uniquely associated to a single stored sequence by equating its index to the sequence ID. The value of each array element will serve as a score for the corresponding sequence.

(2) For each observed peak (generated by protein fragmentation) in the experimental spectrum, load into memory (RAM) all the associated peptide information structures from the peptide information file, file 2 (step 202) which lie within the mass error. With reference to a predefined set of post-translational modifications, the previous method may be modified to allow the consideration of modified peptides because information about the specific amino acids involved is already stored in the sequence information structures. Inclusion of such peptides is achieved as follows (an outline): To consider all possible peptides of a given mass and with one oxidized methionine, the method loads additionally all the sequences with masses smaller than the given mass by exactly the effect of oxidization, and which contain at least one methionine.

(3) For each of the loaded peptides, read the sequence ID (step 203). If selection criteria have been imposed, then use the information in the peptide structure as well as in the sequence structure (read directly from the sequence information file 1 using the sequence ID) to reject/accept the particular peptide (step 204). If it is not rejected then increase by one the corresponding integer count in the scoring array (step 205).

(4) After completion of the above, for all experimentally observed peaks and for all loaded peptide structures, read from the scoring array the top scoring sequences (up to a predefined value). The score of each one of them signifies the number of actual matched experimental fragments. Experience teaches that the actual sequence can be identified to be the highest scoring one when the latter's score sufficiently separates it from the immediate followers (step 206).

(5) After having selected the top-scoring protein, information about the pattern matching (actual peptides matching, etc) is extracted.

(6) Display the results (step 207).

The user input may include: i. The list of experimentally determined protein fragment masses

(mandatory) and the estimated mass error (depending upon sample/machine), also mandatory. ii. The cleaving agent used (optional if there is only one cleaving agent specified for the second database), iii. The protein database in which the search will take place (again optional if there is only one possibility), iv. The number of accepted missed cleavages to be used in the calculations (optional), v. The charge of the peptide (optional) vi. The supported post-translational modifications (induced in vivo or in vitro (e.g., during gel electrophoresis)) (optional), vii. Monoisotopic-to-average mass threshold (optional). The method, depending on the input, uses either monoisotopic or average masses for the calculation of masses, viii. Species and/or keyword selection (optional), ix. Protein mass window (optional) x. Protein isoelectric point window (optional)

3.2.2 MS/MS

The problem addressed by MS/MS is: given the mass and the experimental fragmentation spectrum of a selected peptide (MS/MS or PSD spectrum) predict which peptide, whose parent sequence is present in a first database, gives the closest matching calculated fragmentation pattern. A method in accordance with an embodiment ofthe present invention includes the following steps (Fig. 4):

(1) Pre-process the experimental input spectrum and extract possible information about its underlying peptide sequence (step 301). Exhaustive examination of the fragmentation spectra can provide information about the actual sequence, thereby reducing the number of candidates. This information may be extracted, depending on the implementation, either by a specifically designed routine that pre-processes the input spectrum or by direct user inspection. The following sequence characteristics can be inferred:

- The presence of certain amino acids for example based on the existence of their immonium ions (Cordero et al. in Anal. Chem. 1993, 65:1594-1601.).

- Substrings of amino acids together with their actual location within the peptide sequence; a so-called sequence tag, allowing the calculation of the masses of the two remaining peptide parts (left and right) (Mann and Wilm in Anal. Chem. 1994, 66:4390-4399.).

- The C-terminal amino acid of the peptide, based on knowledge of the cleaving agent and the presence or absence of certain peptide fragment ions.

The following spectral characteristics can be inferred:

- The presence of Serine or Threonine residues indicated by their specific neutral loss pattern.

- Characterization of specific fragment ion series following chemical modification of the peptide (e.g., b and y series) (see for instance Takao et al. in Rapid Commun. Mass Spectrom. 1991 , 5:312-315.).

(2) Using an estimate of a physical attribute of the unidentified protein such as molecular weight or isoelectric point, determine from the sequence information file 1 a list of candidate proteins. The list may be further reduced by using keywords and/or other attributes of the unidentified protein such as taxonomic or lexicographic attributes to eliminate irrelevant protein sequences from the list. From this final list, load in RAM all the relevant peptides, the ones within the allowed mass window; considerations concerning post-translational modifications being handled as in the peptide mass fingerprinting method above (step 302).

(3) Reject peptides that do not meet the selected criteria (if any) or to specific sequence characteristics extracted at step 1.

(4) For each one of the remaining sequences generate the expected theoretical fragmentation spectrum using well-known rules, established in literature (Roepstorff and Fohlman in Biomed. Mass Spectrom. 1984, 1:601.) (step 303).

(5) Reject peptides whose spectra do not conform to the spectral characteristics, as extracted in step 1.

(6) Compare each one of these spectra against the experimental one and sort them according to the number of matched peaks (always within the given mass tolerance) (step 304). (7) Store the sorted list of the highest scoring ones (higher than a predefined value) (step 305).

(8) Optional: Perform an additional comparison for each of these highest scoring spectra, against the observed one, utilizing a suitable statistical similarity algorithm (step 306).

(9) Optional: Derive a final scoring list based on this second comparison routine (step 307).

The criteria for successful selectivity may be similar to the ones of the MS fingerprinting method described above.

(10) Display the top scoring sequences (step 308).

User input may be: a. The input spectrum data (peptide fragment masses and their intensities) (mandatory) b. The estimated mass tolerance (mass accuracy) (mandatory). c. The cleaving agent used (optional if the second database only has one). d. The protein database in which the search will take place (again optional if there is only one). e. The mass "window" representing the maximum allowed mass difference within which two fragment peaks are considered identical (optional) f. The charge of the peptide (optional) g. The supported post-translational modifications (optional); h. The ion series (optional). It instructs the method to consider only certain series of fragment ions during derivation of the theoretical spectrum. i. Monoisotopic-to-average mass threshold (optional). j. Protein and/or keyword selection (optional). k. Protein mass window (optional)

I. Protein isoelectric point (optional) m. Any further selection criteria (optional). 3.3 Linking peptide mass fingerprinting and MS/MS

In many situations, the peptide mass fingerprinting scoring list is not selective enough (the top-scoring sequences are too close to uniquely identify a protein). It can also happen that the actual sequence appears high up in the scoring list but not on the top. Such a situation may arise when, for example, two or more proteins are present in the sample tested experimentally (e.g., 1-D gel separated proteins). One then expects both proteins to appear high up in the scoring list but none of them necessarily will be the top-scoring one.

In such cases an MS/MS experiment may be performed, after selecting a particular protein fragment, hoping that it will clarify the situation. The scoring lists of a peptide mass fingerprint experiment can be combined with the scoring list of one (or more if needed) MS/MS spectral analysis, so as to unambiguously identify the actual protein.

Such a method in accordance with an embodiment of the present invention may consist of a combination of a peptide mass fingerprinting method (as described above with the same inputs) and a MS/MS method (as described above with the same inputs), accompanied by a routine parsing the scoring lists and reporting all candidate sequences occurring in both lists. Then the sequences are sorted according to the sum of the individual scores (or according to a similar function containing the different sorting factors). The resulting scoring list will exhibit selectivity higher than that of the original ones.

3.4 Handling of oligonucleotides such as DNA sequences

When the first database contains oligonucleotides such as DNA sequences an extension of the database structure in accordance with the present invention is preferred. The format of the output files remains the same. The following additional steps are preferably taken:

1. Immediately after the retrieval of each DNA sequence from the initial file (Fig 1A), a software routine is invoked performing oligonucleotide translation for all six reading frames. The viability of all the derived polypeptides is decided upon predefined criteria (e.g., number of translatable residues).

2. When the peptide information file is created by the application of the simulated cleavage step, a flag, indicating the reading frame of the current peptide, is also stored. When the file is of protein sequences the values of this flag is, of course, of no significance. The offset now refers to the offset of the first nucleotide (of the first codon to be translated) in the oligonucleotide subsequence and the length measures the total number of nucleotides (to be translated). No amino acid sequence is stored. The stored information is sufficient to recover its oligonucleotide counterpart during runtime and the increase in execution time of the algorithm is insignificant.

4. Implementation

Fig. 5 is a schematic representation of a computer systam which may be used with the present invention. A computer 10 is depicted which may include a video display terminal 14, a data input means such as a keyboard 16, and a graphic user interface indicating means such as a mouse 18. Computer 10 may be implemented as a general purpose computer, e.g. a UNIX workstation or preferably a personal computer.

Computer 10 includes a Central Processing Unit ("CPU") 15, such as a conventional microprocessor of which a Pentium III processor supplied by Intel Corp. USA is only an example, and a number of other units interconnected via system bus 22. The computer 10 includes at least one memory. Memory may include any of a variety of data storage devices known to the skilled person such as random-access memory ("RAM"), read-only memory ("ROM"), nonvolatile read/write memory such as a hard disc as known to the skilled person. For example, computer 10 may further include random-access memory ("RAM") 24, read-only memory ("ROM") 26, as well as an optional display adapter 27 for connecting system bus 22 to an optional video display terminal 14, and an optional input output (I/O) adapter 29 for connecting peripheral devices (e.g., disk and tape drives 23) to system bus 22. Video display terminal 14 can be the visual output of computer 10, which can be any suitable display device such as a CRT-based video display well-known in the art of computer hardware. However, with a portable or notebook-based computer, video display terminal 14 can be replaced with a LCD-based or a gas plasma-based flat-panel display. Computer 10 further includes user interface adapter 19 for connecting a keyboard 16, mouse 18, optional speaker 36, as well as allowing optional physical value inputs from physical value capture devices such as analysers 40 in an external system 20. The analysers 40 may be any suitable analysers for capturing physical parameters of biomolecules. These analysers may include devices for carrying out cleavage experiments on biomolecules. Additional or alternative analysers 41 for capturing additional physical parameters of biomolecules in an additional or alternative physical system 21 may also connected to bus 22 via a communication adapter 39 connecting computer 10 to a data network such as the Internet, an Intranet a Local or Wide Area network (LAN or WAN) or a CAN. This allows transmission of physical values, e.g. the results of cleavage experiments or MS/MS experiments from a near location and transmitting it to a far location, e.g. via the Internet, where a processor carries out a method in accordance with the present invention and returns a parameter relating to the unidentified biomolecule to a near location. The present invention also includes within its scope that the relevant physical values are input directly into the computer using the keyboard 16 or from storage devices such as 23.

Computer 10 also includes a graphical user interface that resides within machine-readable media to direct the operation of computer 10. Any suitable machine-readable media may retain the graphical user interface, such as a random access memory (RAM) 24, a read-only memory (ROM) 26, a magnetic diskette, magnetic tape, or optical disk (the last three being located in disk and tape drives 23). Any suitable operating system and associated graphical user interface (e.g., Microsoft Windows) may direct CPU 15. In addition, computer 10 includes a control program 51 which resides within computer memory storage 52. Control program 51 contains instructions that when executed on CPU 15 carry out the operations described with respect to any of the methods of the present invention.

Those skilled in the art will appreciate that the hardware represented in FIG. 5 may vary for specific applications. The mechanisms of the present invention are capable of being distributed as a program product in a variety of forms. Examples of computer readable signal bearing media include: recordable type media such as floppy disks and CD ROMs and transmission type media such as digital and analogue communication links. TABLES:

Table 0: (Original input database, Fig 1A)

Table 1 : SEQUENCE STRUCTURE (FILE 1, Fig 1B)

Table 2: PEPTIDE STRUCTURE (FILE 2, Fig 1B)

Table 3: PEPTIDE INDEX STRUCTURE (FILE 3, Fig 1B)

Table 4: TAXONOMIC FILES

Table 5: KEYWORD FILES

Claims

1. A method for preparing a computer system having a processor and a data storage device for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the method using a library of known sequences of biomolecules, the method comprising the step of: generating a plurality of subsequences from the known sequences in the library by simulated action of the cleaving agent on at least selected ones ofthe known sequences in the library, and associating each subsequence with an identifier, calculating a value for each subsequence of a corresponding physical attribute of that subsequence, sorting all the plurality of subsequence identifiers using the value of the corresponding physical attribute as the sorting criterion, and storing all the plurality of subsequence identifiers and values of the corresponding physical attributes in the sorted condition, including storing identifiers of subsequences and the corresponding values of the physical attribute which have the same value of the corresponding physical attribute if these are present.

2. The method according to claim 1 , wherein the corresponding physical attribute is a predicted mass of the subsequence.

3. The method according to claim 2, wherein the mass is at least one of average, monoisotropic, negative ion mode and positive ion mode mass.

4. The method according to any previous claim, further comprising the steps of: comparing a physical attribute of the subsequences with the same physical attribute of molecules produced experimentally using the cleaving agent, and analyzing the susbsequences of the selected ones of the sequences in order to rank the selected ones of the sequences based on a matching criterion with respect to the unidentified biomolecule.

5. The method according to claim 4, further comprising the step of generating a subset of the subsequences using the values of the physical attribute of the experimentally determined molecules to select subsequences which match the values of the physical attribute, and then analyzing the subset.

6. The method according to any previous claim, wherein the biomolecule is one of a peptide, a protein, an oligonucleotide and an oligosaccharide

7. The method according to any previous claim, further comprising the step of associating and storing taxonomic and/or lexicographic information with each subsequence identifier, the taxonomic and/or lexicographic information relating to the sequence from which the subsequence has been derived.

8. The method according to any of the claims 4 to 7, wherein the analyzing step includes comparison of predicted masses of the molecules represented by at least selected ones of the plurality of subsequences with the masses determined from results of the experimental cleaving of the biomolecule.

9. The method according to any of the claims 4 to 8, wherein the analyzing step includes calculating predicted mass spectra from at least selected ones of the plurality of subsequences and comparing the predicted mass spectra with mass spectra of the molecules resulting from the experimental cleaving of the biomolecule.

10. The method according to any previous claim, wherein the at least selected ones of the known sequences in the library are determined in accordance with at least one selection parameter.

11 The method according to claim 10, wherein the at least one selection parameter is one of a taxonomic and a lexicographic parameter.

12. The method according to claim 8 or 9, wherein the at least selected ones of the plurality of subsequences are determined in accordance with at least one selection parameter.

13. The method according to claim 12, wherein the selection parameter is at least one of a taxonomic and a lexicographic parameter of the sequence from which the relevant subsequence is derived , C-terminal amino acid information and amino acid content of any one of the plurality of subsequences.

14. The method according to any previous claim, wherein the simulated cleavage step comprises allowing for at least one missed cleavage.

15. A computer based system for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the computer based system having a processor and a data storage device and access to a library of known sequences of biomolecules, the apparatus comprising: means for generating a plurality of subsequences from the known sequences in the library by simulated action of the cleaving agent on at least selected ones of the known sequences in the library, means for associating each subsequence with an identifier, means for calculating for each subsequence a value of a corresponding physical attribute of that subsequence, means for sorting all the plurality of subsequence identifiers using the value of the corresponding physical attribute as the sorting criterion, and means for storing all the plurality of subsequence identifiers and corresponding values of the physical attribute in the sorted condition, including storing identifiers of subsequences and the corresponding values of the physical attribute which have the same value of the corresponding physical attribute if present.

16. The system according to claim 15, wherein the corresponding physical attribute is a predicted mass of the subsequence.

17. The system according to claim 6, wherein the mass is at least one of average, monoisotropic, negative ion mode and positive ion mode mass.

18. The system according to any of claims 15 to 17, further comprising: means for comparing a physical attribute of the subsequences with the same physical attribute of molecules produced experimentally using the cleaving agent, and means for analyzing the susbsequences of the selected ones of the sequences in order to rank the selected sequences based on a matching criterion with respect to the unidentified biomolecule.

19. The system according to claim 18, wherein means for comparing generates a subset of the subsequences using the values of the physical attribute of the experimentally determined molecules to select subsequences which match the values of the physical attribute and the means for analysing analyses the subset.

20. The system according to any of claims 15 to 19, wherein the biomolecule is one of a peptide, a protein, an oligonucleotide and an oligosaccharide

21. The system according to any of claims 15 to 20, further comprising means for associating and storing taxonomic and/or lexicographic information with each subsequence identifier, the taxonomic and/or lexicographic information relating to the sequence from which the subsequence has been derived.

22. The system according to any of the claims 18 to 21 , wherein the means for analyzing includes means for comparison of predicted masses of the molecules represented by at least selected ones of the plurality of subsequences with the masses determined from results ofthe experimental cleaving of the biomolecule.

23. The system according to any of the claims 18 to 22, wherein means for analyzing includes means for calculating predicted mass spectra from at least selected ones of the plurality of subsequences and means for comparing the predicted mass spectra with mass spectra of the molecules resulting from the experimental cleaving of the biomolecule.

24. The system according to any of the claims 15 to 23, wherein the means for generating a plurality of subsequences determines the at least selected ones of the known sequences in the library in accordance with at least one selection parameter.

25.The system according to claim 24, wherein the at least one selection parameter is one of a taxonomic and a lexicographic parameter.

26. The system according to claim 22 or 23, wherein the means for analyzing determines at least selected ones of the plurality of subsequences in accordance with at least one selection parameter.

27. The system according to claim 26, wherein the selection parameter is at least one of a taxonomic and a lexicographic parameter of the sequence from which the relevant subsequence is derived, C-terminal amino acid information and amino acid content of any one of the plurality of subsequences.

28. The system according to any of claims 15 to 27, wherein means for generating a plurality of subsequences carries out the simulated cleavage allowing for at least one missed cleavage.

29. A computer program product executable on a computer for executing any of the methods of claims 1 to 14.

30. A data carrier having stored thereupon a computer program product executable on a computer for executing any of the methods of claims 1 to 14.

31. A method for analyzing mass spectra of an unidentified biomolecule determined experimentally from cleavage of the molecule with a cleaving agent, the method using a library of known biomolecules, the method comprising the steps of: inputting at a near location the mass spectra of the unidentified molecule; transmitting the mass spectra to a remote location having a processing engine for executing the method in accordance with any of the claims 1 to 14; receiving at a near location a result of the method in accordance with any of the claims 1 to 14.

32. A biomolecule analysis tool implemented on a computational device including a memory, the tool being of the type which accesses a library of known sequences of biomolecules, the tool comprising: a data structure within the memory which comprises a plurality of identifiers of subsequences, the subsequences being those obtained by simulated action of the cleaving agent on at least selected ones of the known sequences in the library, each subsequence identifier being associated in the data structure with a corresponding value of a physical attribute of that subsequence, the data structure including identifiers of subsequences which have the same value of the corresponding physical attribute if these are present, and the data structure being in a sorted condition having been sorted using the corresponding physical attribute as the sorting criterion.

32. The analysis tool according to claim 31 , wherein the memory is one of a random access memory, non-volatile memory, a hard disk, solid state memory, a data carrier.

33. The analysis tool according to claim 31 or 32, further comprising means for calculating predicted masses of molecules represented by the plurality of subsequences.

34. The analysis tool according to any of claims 31 to 33, wherein the selected ones of the known sequences in the library have been selected by each known sequence having a subsequence whose corresponding physical attribute matches the physical attribute of one of the molecules derived by applying the same cleaving agent to the biomolecule in a physical experiment.

35. A computer executed method for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the method using a library of known sequences of biomolecules, the method comprising the step of: obtaining from the library a first set of data structures representing known sequences of biomolecules; generating a second set of data structures representing subsequences of at least selected ones of the known sequences by simulated action ofthe cleaving agent on at least selected ones of the known sequences in the first set of data structures; applying a function to the generated subsequences in the second set of data structures, the function predicting a value of a physical attribute of each subsequence; generating an identifier for each subsequence, sorting the identifiers of the subsequences in accordance with the value of the physical attribute associated therewith, receiving experimental results of values of the physical attribute of molecules generated by an operation of the cleaving agent on the biomolecule; using the received results to select out a subset of the subsequence identifiers based on the values of the physical attribute of the received results; and selecting and outputting at least one known sequence of the first set of data structures by comparing values of the physical attribute of the experimental results with the values of the physical attribute of the subset.

36. The method according to claim 35, wherein the comparing step includes the step of analyzing the subset of subsequences of the second set of data structures in order to rank the selected known sequences based on a matching criterion with the selected sequences of the first data structure.

37. The method according to claim 35 or 36, wherein the biomolecule is one of a peptide, a protein, an oligonucleotide and an oligopolysaccharide.

38. The method according to any of claims 35 to 37, further comprising the step of storing the second set of data structures.

39. The method according to any of claims 35 to 38, further comprising the step of generating the second set of data structures by combining taxonomic and/or lexicographic information with each generated subsequence, the information relating to the known sequence from which the subsequence has been derived.

40. The method according to any of the claims 35 to 38, wherein the function includes calculating predicted masses of the molecules represented by at least selected ones of the plurality of subsequences of the second set of data structures and the experimental results of the operation of the cleaving agent on the biomolecule include the masses determined from the experimental cleaving of the biomolecule.

41. The method according to any of the claims 35 to 38, wherein the function includes calculating predicted mass spectra from at least selected ones of the plurality of subsequences of the second set of data structures, and the experimental results of the operation of the cleaving agent on the biomolecule include mass spectra of the molecules resulting from the experimental cleaving of the biomolecule.

42. The method according to any of claims 35 to 41 , wherein the at least selected ones of the known sequences in the library are selected in accordance with at least one selection parameter.

43. The method according to claim 42, wherein the at least one selection parameter is one of a taxonomic and a lexicographic parameter.

44. The method according to claim 40 or 41 , wherein the at least selected ones of the plurality of subsequences of the second set of data structures are selected in accordance with at least one selection parameter.

45. The method according to claim 44, wherein the selection parameter is at least one of a taxonomic and a lexicographic parameter of the sequence from which the relevant subsequence is derived , C-terminal amino acid information and amino acid content of any one of the plurality of subsequences.

46. The method according to any ofthe claims 35 to 45, wherein the simulated action of the cleavage agent comprises allowing for at least one missed cleavage.

47. A method for analyzing results determined experimentally after cleavage of a biomolecule with a cleaving agent, the method using a library of known biomolecules, the method comprising the steps of: inputting at a near location the experimental results of the action of the cleaving agent on the unidentified biomolecule; transmitting the results to a remote location having a processing engine for executing the method in accordance with any of the claims 35 to 46; receiving at a near location a result of the method in accordance with any of the claims 35 to 46.

48. A data structure recorded on a computer readable data carrier, the data structure comprising a plurality of identifiers of subsequences, the subsequences being those obtained by simulated action of the cleaving agent on at least selected ones of the known sequences in a library of biomolecules, each subsequence identifier being associated in the data structure with a corresponding value of a physical attribute of that subsequence, the data structure including identifiers of subsequences which have the same corresponding value ofthe physical attribute if these are present, and the data structure being in a sorted condition, the corresponding physical attribute having been used as the sorting criterion.

49. A computer based system for analyzing an unidentified biomolecule after cleavage of the biomolecule with a cleaving agent, the computer based system having access to a library of known sequences of biomolecules, the system comprising: means for obtaining from the library a first set of data structures representing known sequences of biomolecules; means for generating a second set of data structures representing subsequences of at least selected ones of the known sequences by simulated action of the cleaving agent on at least selected ones of the known sequences in the first set of data structures; means for applying a function to the generated subsequences in the second set of data structures,.the function calculating a value of a physical attribute of each subsequence; means for generating an identifier for each subsequence, means for sorting the identifiers of the subsequences in accordance with the values of the physical attribute associated therewith, means for receiving experimental results of values of the physical attribute of the molecules generated by an operation of the cleaving agent on the biomolecule; means for using the received results to select out a subset of the subsequence identifiers based on the values of the physical attribute of the received results; and means for selecting and outputting at least one known sequence of the first set of data structures by comparing the values of the physical attribute of the experimental results with the values of the physical attribute of the subset.

50. The system in accordance with claim 49, wherein the means for selecting and outputting includes means for analyzing the subset of subsequences of the second set of data structures in order to rank the selected known sequences of the first set of data structures based on a matching criterion with the selected sequences of the first data structure.

51. A computer program product executable on a computer for executing any of the methods of claims 35 to 46.