WO2008104746A1

WO2008104746A1 - Signature peptide identification

Info

Publication number: WO2008104746A1
Application number: PCT/GB2008/000607
Authority: WO
Inventors: Ian Humphery-Smith
Original assignee: Biosystems Informatics Institute
Priority date: 2007-02-27
Filing date: 2008-02-21
Publication date: 2008-09-04
Also published as: GB0703793D0

Abstract

The present invention relates to a computer implemented method for identifying unknown signature peptides in a database, each signature peptide having a number of residues, comprising the steps of (i) producing a look-up table containing identification numbers to locate every occurrence of a known signature peptide from the database; (ii) identifying identical signature peptides by matching each residue of an unknown peptide with a corresponding residue of a peptide from the database; (iii) storing the identified signature peptides in a storage device; and (iv) displaying the identified signature peptides that are within a predetermined range of residues. The invention also relates to a data processing apparatus for use in such a method, a computer program code executable by a data processing device to carry out the method, to a subunit vaccine containing peptides identified as signature peptides using the computer implemented method and to a method of producing a subunit vaccine using the methodology.

Description

SIGNATURE PEPTIDE IDENTIFICATION

The present invention is concerned with a computer-implemented method for identifying signature peptides and which signature peptides may be utilised in the development of multi-subunit vaccines.

The advent of large-scale genomic sequencing in the past decade, combined with the use of bioinformatics, has resulted in the accumulation of huge quantities of data. In order to derive benefit from this data, it needs to be maximally exploited. The current invention relates to a computer-implemented method for the identification and analysis of small amino acid strings of both known and unknown functional significance, called signature peptides. In particular, the method of the current invention can provide a program that is able to rapidly and accurately identify relevant signature peptides from a wide variety of sources. Instead of merely characterising protein samples, the current invention can provide value-added information in the form of highlighted regions of relevance to protein function which can then be used, for example, to select proteins for inclusion in multi-subunit vaccines.

Existing bioinformatics technologies identify proteins by virtue of subdomains known as domains and modules. Just as the use of molecular biology techniques has revealed that a eukaryotic gene is not continuous, but is divided into subregions, known as introns and exons, proteins are themselves composed of domains and modules. These individual subunits, rather than the entire open reading frame (ORF), are known to contain, and essentially determine, the biological functions of the gene or protein. Modules usually consist of between 40 and 100 amino acid residues and maintain a stable structure when isolated from the native protein. They are usually recognised as contiguous sequence motifs defined by a common subset of conserved residues that are necessary to form the domain core and preserve a given fold.

Domains are elements of overall structure that are self-stabilising and often fold independently of the rest of the peptide. Many domains are not unique to a particular protein but appear in a variety of proteins and are often so-named because they have a significant role in the biological function of the protein that they are part of. An example of a domain is the WD40 domain which can be found in at least 50 different eukaryotic proteins. Proteins may contain up to a few hundred modules, which may occur in long tandem repeats, but may contain only one or two or more domains. Modules and domains can be further defined by virtue of their constituent motifs, which are small conserved signature regions with important functional roles. Motifs are characterised by a specific arrangement of evolutionary conserved amino acids in a protein.

The occurrence of similar domains in different proteins may be as a result of either divergent or convergent evolution. However, duplication of ancestral genes is thought to play an important role in the divergent evolution of genes, as it enables one copy to retain the original function, whilst the second copy differentiates and acquires an alternative function through point mutations, homologous exchange or chromosome rearrangements. This suggests that the domains may have been used as building blocks in the construction of new large proteins, facilitated by exon shuffling or intron re-arrangement at the genetic level.

Proteins with no known homology are likely to be divergent proteins too distantly related to known sequences in databases to have retained similarity. All proteins, however, probably share some common ancestry if one goes far enough back in evolution. Therefore, given the huge accumulation of protein sequences in current databases, it may be expected that some proteins, with no obvious sequence resemblances to any other, share some residues that could represent footprints of ancient common ancestries or recombination with self by intent or by error through evolutionary time.

Instead of using domains and modules as the distinguishing characteristic of proteins, the present invention characterises protein samples by virtue of their signature peptides. Signature peptides are specific short peptide fragments, for example of 6 to 20 amino acids in length, that occur multiple times in proteins, and which characterise a particular type of protein. Signature peptides occur more often than a random string of amino acids, and this higher than expected frequency may be indicative of the importance of a particular function. Furthermore, their occurrence is probably demonstrative of building blocks with inherent structural and/or functional significance within protein molecules which have been used time and time again by evolution to undertake molecular work.

The appearance of these signature peptides in non-related proteins could be due to either sequence convergence, descent from a common ancestor, or a result of gene duplication and / or a variety of copying other errors during chromosomal replication, just as for the evolution of modules and domains. Whatever their origin, signature peptides facilitate the identification of relationships between proteins with patterns of likely biological significance. These patterns may help cast light on the origin and/or function of proteins, including those with no known motifs or homologs.

There already exist bioinformatics search tools for identifying particular regions of peptides, including the BLAST and ClustalW methods. BLAST, a local similarity search tool, works by pre-screening a database for regions of conserved but not identical sequence, and then extending any regions above a threshold to produce alignments. The statistics of BLAST search for long alignments, and so BLAST does not work optimally with short sequences. It supports substitution matrices and if necessary, includes gaps within the peptide sequence to enhance the sensitivity of the match. However, to enhance performance, it includes filters to remove areas of low complexity and so offers no guarantee to report all hits.

ClustalW is a tool used to align a number of sequences. The program works by grouping similar sequences into clusters, treating them as a single entity and then aligning the clusters against each other to produce a final alignment. ClustalW is based on pairwise alignment using Needleman-Wunsch, and when aligning the clusters, it aims to optimise the overall alignment. The sequence order is very important, as any out-of-order regions will appear as mismatches.

According to a first aspect of the present invention, there is, however, provided a computer implemented method for identifying unknown signature peptides in a database, each signature peptide having a number of residues, comprising the steps of: i) producing a look-up table containing identification numbers to locate every occurrence of a known signature peptide from the database; ii) identifying identical signature peptides by matching each residue of an unknown peptide with a corresponding residue of a peptide from the database; iii) storing any identified signature peptides in a storage device; and iv) displaying the identified signature peptides that are within a predetermined range of residues.

In addition there is also provided a data processing apparatus for identifying unknown signature peptides in a database, each signature peptide having a number of residues, comprising: i) a look-up table containing identification numbers to locate every occurrence of a known signature peptide from the database; ii) identifying means for identifying identical signature peptides by matching each residue of an unknown peptide with a corresponding residue of a peptide from the database; iii) a storage device for storing any identified signature peptides; and iv) display means for displaying the identified signature peptides that are within a predetermined range of residues.

A polypeptide is a polymer of amino acid residues joined by peptide bonds, whether produced naturally or synthetically. Polypeptides of less than 10 amino acid residues are commonly referred to as peptides. The present invention identifies signature peptides of for example 6 to 20 amino acid residues and which may also be referred to as polypeptides. A protein is a macromolecule comprising one or more ploypaptide chains.

The identification of particular combinations of signature peptides may advantageously allow the function of hypothetical proteins to be discovered, even in proteins with low levels of sequence similarity to known proteins. Significantly, these signature peptides may be indicative of protein function and because individual proteins may contain several signature peptides, each possibly containing different ancestral origins, this information can be used to create an image of the functional potential of a given gene product. The similarity of signature peptides in various genes within a genome can be used to rank genes on a sliding scale from those displaying a high similarity of signature peptides to those displaying a low similarity of signature peptides. Genes with a high similarity of signature peptides may be used to intelligently select genes for inclusion in multiple subunit vaccines or for selecting targets for intervention strategies, according to the current invention. The present invention is therefore relevant to next generation vaccination strategies.

Therefore, according to a further aspect of the present invention, there is provided a multi-subunit vaccine containing polypeptides that have been identified by virtue of their possession of signature peptides using the computer implemented method of the invention.

Multiple subunit vaccines may be defined as those containing one or more semi-pure antigens. Advantages of using multiple subunit vaccines rather than whole protein vaccines include increased safety, ability to target the vaccine to the site where immunity is required and the ability to differentiate vaccinated animals from infected animals (through inclusion of marker antigens). Furthermore, in whole protein vaccines, although there are multiple immunogens, the majority of these immunogens are poorly expressed and thereby poorly immunogenic. Therefore, the humoral and cellular response caused by whole protein vaccines is not always to the same immunogens. In the development of multiple subunit vaccines, the critical step is the identification, from a myriad of proteins of the pathogen, of the particular individual components that are involved in inducing a sufficient level of protection. However, since the inclusion of a variety of antigens in the vaccine may result in a number of undesired effects, from immuno-suppression to disease enhancement, it is critical to identify those particular antigens that are necessary for inducing protection and to eliminate others. Combining the identification of signature peptides according to the current invention with our understanding of pathogenesis therefore enables the identification of specific antigens from most pathogens that are critical for inducing the appropriate immune responses. The targeting of genes rich in signature peptides offers a means of simultaneously targeting multiple genes. Genes having multiple shared signature peptides suggest that an intervention directed against such genes is also likely to have an effect on genes sharing similar vulnerable regions within other gene products, such as the site being targeted by the immune defences of a suitably primed host.

In addition to vaccine development, there are other potential applications of the method of the present invention, including identifying the function of genes, exon calling and ORF detection, confident prediction of intergenic and intron sequences of incomplete genomes, ranking of single nucleotide polymorphisms (SNPs) with respect to likelihood of functional consequences, interaction patch prediction of relevance to protein-protein docking, and informed site-directed mutagenesis and gene perturbation studies.

Advantageously, according to the tool of the current invention, the order of hits is unimportant unlike in ClustalW, and the statistics are optimised for short peptide hits so where peptides pass through the filters every occurrence of a peptide will be reported, unlike in BLAST. The tool according to the current invention has an efficient algorithm for finding identical regions, is scalable, is able to run on an arbitrary number of CPUs in a cluster and is exact.

There are many benefits of the invention, and which include searches which are conducted independently of:

• gap penalties

• low complexity sequences can be selectively ignored or handles apart

• size of query sequence

• order

• direction of information (as both DNA strands are scanned).

The current invention ignores long strings of high similarity information by homology filtering. Therefore, highly similar sequence information, such as genomic sequences from chimpanzees or humans, is of little help to gene detection as the open reading frames, intergenic and intronic sequences do not differ substantially from one another. The current invention is able to detect conserved sequence information with high selectivity, but since it works in a different way to existing techniques, it can also be complementary to traditional bioinformatics.

An embodiment of the invention will now be described in detail, by way of example only, and with reference to the accompanying drawings, in which:

Figure 1 illustrates the tool of the invention recognising conserved functional elements. Gene ii according to Figure, 1 is functionally similar to gene i, and although it will be detected by BLAST, it will score badly. However, the tool scores conservation of information and so gene ii would score highly using the tool.

Figure 2 is a graphical comparison of the use of the tool of the invention in tomato ESTcontig analysis compared to the prior art tool BLAST.

Figure 3 is a process flow chart illustrating the functionality of a main process carried out by the tool of the present invention.

Figure 4 is a process flow chart illustrating a sub process used by the main process illustrated in Figure 3.

Figure 5 is a process flow chart illustrating a further sub process used by the main process illustrated in Figure 3.

The present invention provides a method and apparatus (in combination sometimes referred to herein as a tool) that uses an advanced proprietary algorithm, enabling users to delve deeper into homology mapping than previous bioinformatics tool. The tool, sometimes referred to herein as TSS, has been developed to deliver information- based biological relationships that provide the most comprehensive homology screen of functional annotations for DNA and proteins. Moreover, the tool of the invention allows a detailed analysis of regionalised functional potential within a given gene or protein and unlike existing tools, does not depend on BLAST, FastA or Smith- Waterman to detect homologies during first-pass analyses. The TSS tool is integrated with non-proprietary data sources, including secondary and tertiary structure and Prosite and EBFs UniProt annotations.

The tool of the invention works by scanning a database to find every occurrence of identical signature peptides that match between an unknown peptide sequence and the database sequences. It looks only for common signature peptides that are perfect matches and are between a minimum and maximum length (the default range is set to 6 to 20 amino acids). However, the program can also be configured to report sequences ranging from 4 protein residues to any arbitrary length If theminimum value is set to 4 protein residues, the program will report a huge number of matches and will be very slow on large data sets.

As mentioned above, the program reports all the peptides with a maximum of 20 protein residues as a default. However, the program is not set to have an upper limit of protein residues so if desired the maximum length of an unknown sequence can be set as a maximum value and the program will report 100% aligned runs as well as shorter peptides.

Any matching region outside the set values will not be reported in the results but it will be stored in a file. For instance, a matching region that is 21 protein residues long has to be stored to prevent a 20 protein residues piece of it being reported as unseen previously.

The nature of biological sequences means that it is sometimes necessary to filter some sequences for complexity and significance. Accordingly, the tool of the present invention implements such features as options in order to improve performance on large data sets. Advantages of the current invention are as follows:

• Comprehensive function-centric gene annotations

• Enables exon calling and ORF detection

• Confident prediction of intergenic and intron sequences of incomplete genomes

• Enhanced proteomic outcomes • Consistently providing reliable functional annotations to Hypothetical Proteins

• Ranking of SNPs with respect to likelihood of functional consequences

• Interaction patch prediction of relevance to protein-protein docking

• Allows informed Site-Directed Mutagenesis and gene perturbation studies

• Superior resolution and selectivity when benchmarked against best-in-class bioinformatics tools for homology and gene detection.

The invention relates to identifying signature peptides by comparing an unknown protein sequence against the known protein sequences in the database to find the areas of commonality. The sequences need to be identical, and gap free which means it is possible to solve the problem using fast lookup methods such as suffix trees, indexes or hashes. The result is a list of known peptides and their location in the database.

It should be noted, when using or implementing the above-described tool, that different computer architectures hold their data in different ways and for the most part this is not a problem. However, when loading binary data, it is possible that the program may need to deal with data that was stored on a different machine. For example, the PowerPC processor (as found in the Apple Macintosh) is big-endian and the Intel x86 (as found in IBM PC Compatibles) is little-endian. This means that a 32 bit integer on a PowerPC would have the four bytes arranged in the expected order but on an x86 processor the first and second pairs are flipped around as can be seen in the table below. If the program encounters data packed on a machine that is not the same endianness then it rearranges the bytes appropriately for example an x86 loading data packed by a PPC would flip the bytes as shown in this table.

In order to achieve the level of performance required to make this process practical for regular use fast peptide matching is used. A traditional fast dynamic programming method would take decades to perform a comparison of all currently known sequences, whereas ideally the solution would take weeks or days. To achieve this, the invention uses a combination of fast peptide matching and parallel processing. Parallel processing can be performed by using a number of CPUs together or a network of computers that run simultaneously.

In this case, a cross platform parallel processing library, MPI, is used which allows programs to be distributed across a cluster of CPUs. One of the simplest and efficient methods of using a cluster is to have a farming harness. In a fanning harness, a master CPU controls a number of slave CPUs by distributing jobs to workers and gathering their results together. The farmer, which is the master CPU, waits for a request from any worker or slave and keeps sending jobs until the entire database has been searched. It then sends a signal to the workers to tell them to send back their final totals. Communication is kept short to limit the network overhead and results are written to a shared file system because that is not a blocking operation, i.e. a processor which cannot do any work until it has an answer.

The program could also run efficiently on a cluster made up of machines of different performance levels, or even distributed across multiple clusters such as the Apple and IBM clusters thanks to the endian swapping code discussed earlier.

It is worth considering that parallel execution of the program will very likely run into file access bottlenecks. The primary bottleneck will be the reading of the database, especially when the parallel jobs are of great size with a great number of workers. However, there are key points to avoid the bottlenecks:

• If the database is small, it should reside in RAM

• If the query set or unknown sequence is small, run the search on a single CPU or at most a small number and no more than four.

• If the database is large, distribute a copy of each node when running a parallel job.

• Larger jobs will benefit more for additional CPUs and will be able to scale near linearly.

• Where possible, use complexity and expectations filters, especially with large database. Figure 3 is a process flowchart illustrating data processing operations carried out by the main process of the method of the present invention.

The method starts reading parameters, namely, sequence files, maximum and minimum peptide length filters, loading the database and the unknown sequence or query from a memory device, such as a disk.. The protein database is pre-processed to be suitable for the hash search by producing a hash look-up table. In other words, each signature peptide sequence in the database is broken into blocks of four residues and sorted accordingly by using a hash algorithm. This algorithm is used because a four-residues block can be quickly turned into an identification number which is a unique number, which would identify the table position to look into the database.

Since the database contains 20 different types of signature peptides and each sequence is broken in four residues, there are 160000 (20^Λ4) elements in the produced hash look-up table. This table also contains the list of the location of every occurrence of each block of four residues in the database.

For instance, the four residues AAAA may occur 10000 times and the entry for AAAA will contain the location of every AAAA in the database. This means that if the program wants to locate the position of every AAAA, it looks it up in the hash look-up table, which is a quick operation.

The tool or method uses a hash function to convert each 32 bits block to an integer because an exact hash uniquely identifies each four-residue combination (also called starting-point).

More specifically, the hash function takes an unsigned integer of four residues represented as four parts. Each part from a peptide sequence has 8 bits each and all four parts make a total of 32-bit integer. However, each residue has only 5 bits so they are arranged to be in a range between 0-25 (26 residues), which is performed by masking out the bits above the bit 5, hence IF and the remaining bits are set to zero. The binary number 01 is then subtracted from each 5-bit residue and the four residues are copied. The copies are masked and rotated to eliminate unneeded bits. The result is finally merged together and an output is returned creating a 32-bit unsigned integer that is unique to the four residues of the original sequence. As previously explained, the tool and method scan the database by using the hash look-up table to find every occurrence of identical signature peptides that match between an unknown peptide sequence or query and the database sequences.

The query can be a single query or entire databases to be compared against other databases and are in the format of AminoAcid (AA) or DNA. If the query sequence is in DNA format, then the DNA query is converted to protein and translated into all six possible frames so the first four residues (first starting point) can be selected S l . (Figure 3) i.e. there are six possible predicted protein sequences, known as reading frames, resulting from such a sequence query. These are the three forward frames and the three reverse sense frames of the intermediate RNA sequence.

However, if the query set is not in DNA format, a starting-point is immediately selected and the query set go straight to the selection of starting points and further to the optional complexity filter.

The next sub process is to align or match the unknown sequence against the hash table using a unique number. This hash- align sub process comprises a number of steps and it is illustrated in Figure 4.

After each starting point is selected, it is possible to use a complexity filter to determine whether the unknown sequence is sufficiently complex S2.

The complexity filter prevents peptide sequences to appear that are only made up of a few different residues. When this filter is used, the sequence must have at least three different residues within them to be displayed in the final result. Many sections of the database are very repetitive and this filter limits their impact. Although it is possible to run the program without filters this may result in very large files, which can be difficult to manage.

If the query set passed the optional complexity test S3, the four characters of a total of 32 bits are converted to an integer S4. Since the first starting point, converted into a 32- bit integer, gives a unique position of the four residues in the query set, these positions can now be found in the hashed look-up database. It becomes a simple matter to look up a block in the look-up table and know where every occurrence of that block is in the database when the database to be searched against was also converted into a list of blocks of four.

The tool can then jump to each of these locations and scan from there to identify the longest peptide sequences. However, if the query set did not pass the complexity test, it goes directly to step S7 where the hash alignment is ended if the query set is not DNA.

When the first selected starting point is found in the database, the program starts comparing the unknown sequence against all of the occurrences of the first starting point in the database by extending the sequences residue by residue and as far as possible within a predetermined min-max range S5. These steps are repeated for each starting point until the end of the unknown sequence.

The consecutive starting points are selected by ignoring the first residue in the previous selected starting point and then by selecting the next four residues in the unknown sequence.

These starting points can then be used to extend the peptide sequence for as long as the residues match. Each matching peptide sequence is stored so that when the program moves forward to the next starting point it does not find the same peptide sequence. However, often when the matched sequence extends beyond the max value, the sequence is stored, but not displayed, so that no future sequences shorter than the max value with the same matched sequence is reported. Moreover, only the matched part of the sequence is stored and not the whole sequence.

When the program determines that a new peptide sequence has been found between minimum and maximum values, then it checks whether the new sequence has a unique matched sequence S6. Each matched sequence is then compared against previously found sequences.

Sometimes it may be advantageous to compute a significance for each new found peptide sequence. This is done by using an expectation filter S8 which in general terms works by calculating the expected frequency of a sequence and blocking sequences that are expected to occur by chance such as those that are short and made up of frequently occurring amino acids. This only occurs if the optional expectation filter is selected. In particular, the program records the frequency of each type of amino acid in the database. When a sequence is found the probability is calculated by taking each letter in the sequence, calculating its probability by dividing the frequency of that amino acid by the total number in the database and then multiplying that value by the value for the previous letter in the sequence. Finally, the accumulated number is multiplied by the total number of residues. Thus, longer sequences become less probable and thus more interesting. The expectation is expressed as a percentage. This expectation can then be divided by the total number of times a sequence is found such that a sequence that is unexpected and yet occurs very frequently has a much smaller number associated with it than a sequence that is expected by chance and does not happen very often.

If the new sequence passed the expectation filter, the result is stored in a storage device and a results file is produced. Otherwise, the method can either continue to the next step or repeat steps 1 to 6 if the query set is DNA.

When no new sequence has been found and if the query set is DNA, the same process starts over again for all the six frames starting with the selection of starting points.

The alignment process ends when a predetermined number of sequences is fulfilled and control returns to the main process.

Returning to Figure 3, the next step involves an optional sorting process, which involves reading and sorting alphabetically a list of sequences found but without considering their frequency in order to make it easier to identify all unique sequences.

Figure 5 shows a process flow chart illustrating the Hash_count sub process for counting the unique sequences produced by the sorting process in greater detail.

As for the alignment process, the counting process converts any DNA query set to protein before proceeding with the next step. The following steps S3 to S 5 are identical to those performed in the alignment process. However, S5 is easier in the counting process than in the alignment process because the tool is looking for an exact match of known length. Of course, the hash look-up table only provides the first four letters so the rest of the sequences needs still to be checked for matching.

When the length of the sequence matches the requested sequence, then the result is stored in a storage device. When the length is different, any other starting points in the database are considered and new sequences in the query set where the length matches exactly are also stored in the storage device together with every occurrence of a sequence in the database.

Processing then returns to the main process and the results from the counting process are read and a final XML output is created and stored in the storing device.

The advantage of using a XML file is that it can easily be parsed and moved into a true SQL database if desired or it can simply be used as a single file. Either way, the data is easily accessible and tags have been specified that describe the parameters used for the search and everything needed to identify what peptides occurred and where.

The program may compare an unknown sequence against the XML output of the hash search and identify the longest and most significant hits that map from the database onto the unknown sequence.

Generally, embodiments of the present invention employ various processes involving data stored in or transferred through one or more computers or data processing devices. Embodiments of the present invention also relate to apparatus and systems for performing these operations. The apparatus may be specially constructed for the required purposes, or it may be a general-purpose computer selectively activated or reconfigured by a computer program and/or data structure stored in the computer. The processes presented herein are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required method steps. A particular structure for a variety of these machines will appear from the description given above.

In addition, embodiments of the present invention relate to computer program code, computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer- implemented operations. Examples of computer-readable media include, but are not limited to, magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media; semiconductor memory devices, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). The data and program instructions of this invention may also be embodied on a carrier wave or other transport medium. Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.

Although the above has generally described the present invention according to specific processes and apparatus, the present invention has a much broader range of applicability than the specific example given. For instance, the current version of the program uses a simple exact matching method and could be improved by adding a substitution scheme. This scheme could improve the sensitivity of the hash search by converting the four letters from the unknown sequence into all possible positive substitutions and looking up their locations in the hash table.

One of ordinary skill in the art would recognize other variants, modifications and alternatives in light of the foregoing discussion.

Once a database has been created of all the signature peptides of 6-20 amino acid residues in length has been constructed for all known proteins, this knowledge can be exploited in a novel context for vaccine development. In this regard, the computer implemented method of the present invention can be used to identify signature peptides for inclusion in subunit vaccines. Subunit vaccines are defined as those containing one or more pure or semi-pure antigens, and may also be known as multiple subunit vaccines or multi-subunit vaccines. In order to develop subunit vaccines, it is critical to identify the individual components out of a myriad of proteins and glycoproteins of the pathogen that are involved in inducing protection. Indeed, some proteins, if included in the vaccine, may be immunosuppressive, whereas in other cases immune responses to some proteins may actually enhance disease. Thus, it is critical to identify those proteins that are important for inducing protection and eliminate the others. In utilising the present invention it is possible to identify specific proteins from most pathogens that are critical in inducing the immune responses. The phrase "immune response" refers to any cellular process that is produced in the animal following stimulation with an antigen and is directed towards the elimination of the antigen from the animal. The immune response generated by subunit vaccines will preferably include both humoral and cell-mediated immune responses.

The potential advantages of using subunits as vaccines are the increased safety, less antigenic competition, since only a few components are included in the vaccine, ability to target the vaccines to the site where immunity is required, and the ability to differentiate vaccinated animals from infected animals (marker vaccines). One of the disadvantages of subunit vaccines is that they generally require strong adjuvants and these adjuvants often induce tissue reactions. Secondly, duration of immunity is generally shorter than with live vaccines. In addition to using a whole protein as a vaccine, it is possible to identify individual epitopes within these protective proteins and develop peptide vaccines. The major disadvantage of peptide vaccines is that they often need to be linked to carriers to enhance their immunogenicity and, secondly, a pathogen can escape immune responses to a single epitope versus multiple epitope vaccines. To overcome some of these disadvantages, chimeric peptides can be made to broaden the immune response to different epitopes. Subunit vaccines are produced from one or more specific protein subunits of a microorganism, rather than the whole protein. Subunit vaccines according to the current invention are prepared using techniques known to a skilled person in the art. The methods for preparation of such a subunit vaccine may comprise, for example; a) introducing into a cell a nucleotide sequence encoding at least a peptide containing a signature peptide, in such a way that translation of the nucleotide sequence is possible when the sequence is within the cell; b) culturing the cell under conditions that allow expression of the peptide and c) isolating the expressed peptide. The methods required for production of a subunit vaccine are routine to a person skilled in the art.

Traditionally, intervention strategies in vaccine development have been directed towards one gene at a time. These genes are then subsequently regrouped into a subunit vaccine. However, the ability to target genes rich in signature peptides according to the current invention offers a means of simultaneously targeting multiple genes. Genes having multiple shared signature peptides suggest that the likelihood is greatest that an intervention strategy directed against such genes is also likely to have an effect in genes sharing similar vulnerable regions within other gene products, e.g. drug targets or the site being targeted by the immune defences of a suitably primed host.

Proteins destined as elements in subunit vaccines can be ranked with respect to the number of signature peptides contained in a given protein. Thereafter, proteins destined for inclusion in multiple subunit vaccination strategies can be ranked primarily with respect to occurrence in multiple pathogen pathways. Such multiple subunit vaccination strategies aim to knockout multiple pathogen targets through stimulating a protective immune response directed at a single pathogen protein but containing multiple signature peptides that are capable of simultaneously targeting a multitude of pathogen pathways via the signature peptides contained in more than one protein.

Additional constraints can be placed upon signature-peptide containing proteins as a means of ranking their overall suitability for inclusion in the subunit vaccines. These procedures constitute a novel invention linked to the current method for identification of signature peptides.

The following is a summary of other parameters that are employed to proteins containing signature peptides as to their overall suitability for inclusion in multi-sub- unit vaccines, i.e. a means of ranking a whole protein's suitability for selection of a given whole protein for inclusion in recombinant vaccines is summarised as follows:

1) Absolute number of distinct pathway hits for each signature peptide contained in a given protein is expressed as a combined total of all the distinct pathways hits.

2) Absolute number of distinct pathway elements targeted, i.e. as distinct from total number of distinct pathways hits, contained in a given protein is expressed as a combined total.

3) The relative importance of targeted pathways to an infectious agent's central metabolism is important. Frequency distribution of signature peptides across all fully sequenced genes and proteins can give a good indication (ranking) of such central metabolism conserved across all studied life forms and/or broken down into portions of the biological kingdom, e.g. plants, animals, prokaryotes, archaea, fungi, etc or even down to more restricted phylogenetic groupings, e.g. Gram positive or Gram negative bacteria.

a. A desirable feature here is the degree of conservation of the central metabolism of infectious agent, as opposed to host organism's central metabolism, e.g. humans b. A ratio can also be established for particular pathogen proteins based on the number of pathogen paralogous signature peptides (unique to the pathogen) : number of homologous signature peptides (found elsewhere in all of biology) or just the host proteome.

4) The extent of surface exposure of signature peptides and thus the ease of likely recognition by the host's immune system.

5) Compatibility with MHC-I and MHC-II antigen presentation. It is noteworthy that the immune system presents information concerning foreign pathogens by the Major Histocompatibility Complex and such peptides form the basis of host recognition via humoral and cellular immunity.

6) Distinctness of signature peptides from host 'self, i.e. peptide sequences of a host organism such as humans. It is desirable that signature peptides of noxious pathogens should be distinct from host organism protein sequences and thereby have the greatest chance of engendering a pathogen-specific immune response. The virtue of using recombinant proteins containing many signature peptides enhances the statistical probability of engendering both a ThI and Th2 response by peptide- mediated means.

In addition, recombinant molecular biological techniques can be exploited to further increase the copy number of one or more signature peptides occurring in a given whole protein as a means of further increasing the likelihood of presentation of signature peptides to both the humoral and cellular immune system. By such means, the signature-peptide containing whole protein acts as a close to native peptide carrier in its unaltered state, but this carrier capacity and intended peptide-centric immunogenicity is further enhanced by manipulation to include multiples copies or one or more naturally occurring signature peptides and their subsequent presentation to a host immune system by any number of methods traditionally exploited in vaccinology.

Furthermore, a kit is provided which contains a subunit vaccine according to the present invention and means for administering the vaccine to an individual in need thereof. Means for administering the vaccine to an individual in need thereof would include a combination of the subunit vaccine with a pharmaceutically acceptable carrier or diluent to produce a pharmaceutical composition (which may be for human or animal use).

Generally, pharmaceutical compositions and/or vaccine compositions of the present invention will comprise a therapeutically effective amount of peptide containing a signature peptide coupled to an antigenic polypeptide which acts as an adjuvant.

The phrase "therapeutically effective amount" as used herein refers to an amount sufficient to stimulate by at least about 15%, preferably by at least 50%, more preferably by at least 90%, and most preferably completely, an animal's immune system, causing it to generate an immunological memory against the antigenic determinant.

The term "adjuvant" refers to a compound or mixture that enhances the immune response by having at least one antigenic determinant.

Examples of adjuvants include, but are not limited to, aluminium hydroxide, aluminium phosphate, aluminium potassium sulphate (alum), beryllium sulphate, silica, kaolin, carbon, water-in-oil emulsions, oil-in-water emulsions, muramyl dipeptide, bacterial endotoxin, lipid X, Corynebacterium parvum (Propionobacterium acnes), Bordetellapertussis, polyribonucleotides, sodiumalginate, lanolin, lysolecithin, vitamin A, saponin, immuno stimulating complexes (ISCOMs), liposomes, levamisole, DEAE-dextran,blocked copolymers or other synthetic adjuvants. The phrase "pharmaceutically acceptable carrier or diluent" refers to molecular entities and compositions that are physiologically tolerable and do not typically produce an allergic or similarly untoward reaction when administered to a human. Preferably, as used herein, the term "pharmaceutically acceptable" means approved by a regulatory agency or other generally recognized pharmacopeia for use in animals, and more particularly in humans. The term "carrier" refers to a diluent, adjuvant, excipient, or vehicle with which the compound is administered. Such pharmaceutical carriers can be sterile liquids, such as water and oils, including those of petroleum, animal, vegetable or synthetic origin, such as peanut oil, soybean oil, mineral oil. sesame oil and the like. Water or soluble saline solutions and aqueous dextrose and glycerol solutions are preferably employed as carriers, particularly for injectable solutions. Diluents of various buffer content (e.g.,Tris-HCI, acetate, phosphate), pH and ionic strength may be used and additives such as detergents and solubilizing agents (e.g., Tween 80, Polysorbate 80), antioxidants (e.g., ascorbic acid, sodium metabisulfite), preservatives (e.g., Thimersol, benzyl alcool) and bulking substances (e.g., lactose, mannitol) may be added. Suitable pharmaceutical carriers are known to one of skill in the art.

The compositions may be prepared in liquid form, or may be in dried powder.

Pharmaceutical compositions may be for administration by injection, or prepared for oral, pulmonary, nasal or other forms of administration. The mode of administration of the complexes prepared in accordance with the invention will necessarily depend upon such factors as the stability of the complex under physiological conditions, the intensity of the immune response required, the type of pathogen etc.

Preferably, the complex is administered using standard procedures, for example, intravenously, subcutaneously, intramuscularly, orally or by aerosol administration.

After formulation, the vaccine may be incorporated into a sterile container which is then sealed and stored at a low temperature, for example, 4⁰C, or it may be freeze-dried. Lyophilisation permits long-term storage in a stabilised form.

Claims

CLAIMS:

1. A computer implemented method for identifying unknown signature peptides in a database, each signature peptide having a number of residues, comprising the steps of: i) producing a look-up table containing identification numbers to locate every occurrence of a known signature peptide from the database; ii) identifying identical signature peptides by matching each residue of a unknown peptide with a corresponding residue of a peptide from the database; iii) storing the identified signature peptides in a storage device; and iv) displaying the identified signature peptides that are within a predetermined range of residues.

2. A computer implemented method according to claim 1. where each residue is matched after a fast word matching is used.

3. A computer implemented method according to claim 2, where the fast word matching involves selecting starting points from the unknown signature peptides to be used in a hash function.

4. A computer implemented method according to claim 3, where the unknown signature peptide passes through a complexity filter after each starting point is selected.

5. A computer implemented method according to any previous claim, wherein the identified signature peptides pass through an expectation filter before storing said signature peptides in the storage device.

6. A computer implemented method according to any previous claim, wherein the predetermined range is set to from 6 to 20 residues.

7. A computer implemented method according to any previous claim, further comprising the step of sorting alphabetically a list of all the identified signature peptides without considering their frequency.

8. A computer implemented method according to any previous claim, which is used in a parallel processing system.

9. A data processing apparatus for identifying unknown signature peptides in a database, each signature peptide having a number of residues, comprising: i) a look-up table containing identification numbers to locate every occurrence of a known signature peptide from the database; ii) identifying means for identifying identical signature peptides by matching each residue of a unknown peptides with a corresponding residue of a peptide from the database; iii) storage device for storing the identified signature peptides; and iv) displaying the identified signature peptides that are within a predetermined range of residues.

10. The data processing apparatus according to claim 9, where each residue is matched after a fast word matching is used.

1 1. The data processing apparatus according to claim 10, where the fast word matching involves selecting starting points from the unknown signature peptide to be used in a hash function.

12. The data processing apparatus according to claim 1 1, where the unknown signature peptide passes through a complexity filter after each starting point is selected.

13. The data processing apparatus according to any of claims 9 to 12, wherein the identified signature peptides pass through an expectation filter before storing said signature peptides in the storage device.

14. The data processing apparatus according to any of claims 9 to 13, wherein the predetermined range is set to 6 to 20 residues.

15. The data processing apparatus according to any of claims 9 to 15, further comprising sorting means for alphabetically sorting a list of all the identified signature peptides without considering their frequency.

16. The data processing apparatus according to any of claims 9 to 15, wherein the apparatus is used in a parallel processing system.

17. Computer program code executable by a data processing device to carry out the method of any of claims 1 to 7.

18. A computer program product, comprising a computer readable medium bearing computer program code as claimed in claim 17.

19. A subunit vaccine containing peptides which have been identified as signature peptides using the computer implemented method of any one of claims 1 to 8.

20. A subunit vaccine according to claim 19 wherein the signature peptides are from 6 to 20 amino acids in length.

21. A subunit vaccine according to claim 19 or 20, wherein the selected signature peptides are compatible with MHC-I and MHC-II antigen presentation.

22. A subunit vaccine according to any of claims 19 to 21 , wherein the selected signature peptides are distinct from peptide sequences of a host organism.

23. A subunit vaccine according to any of claims 19 to 22, comprising recombinant proteins containing multiple signature peptides.

24. A subunit vaccine according to any of claims 19 to 22 comprising recombinant proteins containing an increased copy number of one or more signature peptides.

25. A kit containing a subunit vaccine according to any of claims 19 to 24 and means for administering the vaccine to an individual in need thereof.

26. A method of producing a subunit vaccine comprising the steps of:- a) selecting one or more polypeptides containing a signature peptide using the computer implemented method according to any one of claims 1 to 8; b) expressing the polypeptide containing the signature peptide in a cell; and c) recovering the polypeptide containing the signature peptide from said cell.

27. A method of producing a subunit vaccine according to claim 26, wherein the selected signature peptides are compatible with MHC-I and MHC-II antigen presentation.

28. A method of producing a subunit vaccine according to claim 26, wherein the selected signature peptides are distinct from the peptide sequences of a host organism.

29. A method of producing a subunit vaccine according to claim 26, comprising recombinant proteins containing multiple signature peptides.

30. A method of producing a subunit vaccine according to claim 26, comprising recombinant proteins containing an increased copy number of one or more signature peptides.