WO2003015001A2

WO2003015001A2 - Method for identification of protein function

Info

Publication number: WO2003015001A2
Application number: PCT/GB2002/003244
Authority: WO
Inventors: Elie Giraud; Jérôme GOMAR; Konstandino Kosmatopoulos; Roger Lahana; Anthony Rees
Original assignee: Synt:Em S.A.
Priority date: 2001-08-03
Filing date: 2002-07-15
Publication date: 2003-02-20
Also published as: WO2003015001A3

Abstract

The present invention relates to a method of determining functions of protein sequences using computational methods. The method involves the identification of regions within a query protein which have a desired function, wherein the identification is independent of sequence homology. The method utilises a residue independent computational model whose inputs are not sequences but physicochemical and topological parameters derived from those sequences.

Description

METHOD FOR IDENTIFICATION OF PROTEIN FUNCTION

The present invention relates to methods of determining functions of protein sequences using computational methods.

It has been estimated that of the many thousands of genes (in excess of 30,000) in the human genome approximately 50% will initially be of unknown function (Fischer & Eisenberg 1999) . One route to understanding the function of proteins is via their 3-D structures. The comparison of 3D protein structures has shown that tertiary ^■ structure is more conserved than primary structure during evolution (Blundell et al, 1987). A large number of proteins are known that have similar or related functions, have the same fold but often have low homology at the level of the protein sequence (Thornton et al., 1991; Holm and Sander, 1994). It has been estimated that up to one third of known sequences would be homologous to at least one protein of known 3D structure (Chothia, 1992; Orengo et al, 1994) .

The problem is that currently, only a few thousand 3D protein structures are known while several tens of thousands of primary sequences are known. Many proteins have similar functions, for example kinase activity, or the ability to direct the protein to a particular sub-cellular compartment. In determining the function of the sequences, an approach is to look for sequence homologies to proteins - or regions thereof - whose function is known. A problem in the art is currently how to handle sequences whose homology to known protein families is below the level that current methods are able to assign to a particular structural or functional family. Of course, if the sequence is entirely new so that no other protein is known with that structure or function, then by definition, it is a new family member. Again, for some protein families the functional signatures may not map directly onto structure, so that dissimilar structures may have functional similarity (e.g. bacterial and mammalian serine proteases) . In these cases, three dimensional structural information may be insufficient and additional information, such as sequence motifs, common residue clusters or characteristic surface properties, will be required (see Orengo et al , 1999) .

Numerous methods have been developed that seek to provide structural/functional relationships between protein sequences where the information is extracted directly from the sequence. Some frequently used examples are listed below, details of which are found on the ExPASy Molecular. Biology Server www.expasy.ch.

Approaches include PROSITE, a method for detecting active sites and patterns in protein sequences (Bairoch, 1991) ; PFAM/SCOP analysis, a protein domain assignment method ^' (Murzin et al , 1995); COGs (Clusters of Orthologous Groups) analysis, a method of protein function prediction; Superfamily assignments; Functional categories assignment

(Bray et al 2000) ; Signal peptide presence, for secreted or membrane proteins; Transmembrane region assignment (Wallin & Heijne, 1998) ; Secondary structure prediction (Rost & Sander 1995; Gamier et al 1996, see also the following servers at http : //www . nbrf . George town . edu/pi rwww/, http : //dodo . cpmc . col u bi a . edu/pp/predi c tprotein . h tml , or http : //www. expasy. ch/) ; and Homologous 3D structure check (Laskowski 1993) . An example of a method that can be used to assign a sequence to a particular 'fold' (structure) family is fold assignment by sequence similarity to protein of known 3D structure. The reported fraction of- fold assignments in the various genomes amounts only to about 10-20% of sequences. In outline, fold assignment include the steps of :

1) Sequence alignment of the target to the template. (e.g. BLAST (Altschul et al, 1990) PSI -BLAST (Deleage et al . , 1997) ; FASTA (Pearson & Lipman, 1988 - using either pairwise algorithms - Smith and Waterman, 1981; or multiple alignment algorithms - Needleman and Wunsch, 1970 and^' Thompson et al, 1994 (CLUSTAL W) ) ; 2) Identification of structurally conserved regions (SCRs) from the template (s) and use of these to form the framework of the target model;

3) If needed, a check of the secondary structure prediction in the template / target, after which the alignment is modified by hand if necessary;

4) Building of the remaining structurally conserved variable regions;

5) Building of the side-chains; and

6) Checking the 'goodness' of the postulated model (e.g. using PROCHECK) . The process may be repeated from step 1 one or more times until step 6 is acceptable.

For this approach to work, about 30% sequence identity is usually necessary between the template and the target to use this method. It also requires full length sequence alignment where gaps or insertions can be positioned with confidence. Even so, errors in alignment occur - known as "misleading local sequence alignments" (MLSAs) , where an apparently unambiguous alignment between two proteins may not reflect- the correct "threading" of the sequence onto the structure (see Saqi et al 1998) .

There are also currently methods used to identify new proteins with below threshold sequence similarity. These include sequence profiling (Gribshov et al (1990) , or sequence motif searching (Bork & Gibson, 1996) . Sequence profile methods use evolutionary information from neighbouring sequences in the sequence database to build a profile. An iterative sequence profile method able to detect distant relationships is PSI-BLAST.

A further method is fold assignment, or 'threading'-. These methods explicitly incorporate structural information from available 3-D protein structures. In many cases these methods can detect distantly diverged proteins as well as unrelated proteins with a similar fold (see PROTEINS: 23(5), 1995; Suppl 1, 1997 and Suppl 3, 1999 for further details of methods and results) . However, unless at least one 3D protein structure of the family is known, the method will be unable to assign the new sequence to a structural or functional family.

Fold assignment methods are at least as successful as sequence profiling methods and, in addition, are able to assign another 10-15 % of open reading frames (ORFs) from genome sequencing projects. Furthermore, some of the predictions from folds assignment methods are not detectable using sequence based methods. Conversely, sequence based methods sometimes identify distant relationships that fold assignment methods do not detect. This is because the sequence methods incorporate evolutionary information from neighbouring sequences whereas traditional fold methods typically do not.

As of today, the majority of the fold assignments correspond to ORFs with known functions, unevenly covering the various functional classes (Riley, 1993 ) . The lowest number of assigned folds corresponds to proteins in the membrane, ribosomal, transcriptional and "unknown function" categories. 30-40% of ORFs in the various genomes are of unknown function. Of these half have no matches in the sequence database - classified as sequence orphan ORFs or "sequence ORFans" . Most are likely to be very distant members of existing families; however fold assignment methods succeed in predicting folds for only a small number of ORFans (Dubchak et al 1998) . Sequence based methods, are not able to assign folds to sequence ORFans because by definition ORFans have no sequence neighbours. Thus, in order to assign folds of function to the majority of sequence ORFans, more sensitive methods will be required.

Another problem in relation to protein structure is in providing an accurate and reliable theoretical method to identify peptides that bind with high affinity to HLA molecules. The identification of tumour and virus immunogenic epitopes is of great importance for the design of tumour and virus vaccines. The most common property of all the immunogenic peptides is their high affinity for the HLA molecule (Sette et al, 1994; Oukka et al , 1996, van den Burg et al, 1996; Tourdot et al, 1997) . Hence, considerable effort has gone into measuring the affinity of peptides to HLA molecules. A reliable theoretical method to identify peptide sequences within a given antigen that bind strongly to HLA would, therefore, be of great utility for the selection of immunogenic peptides, provided it is both efficient and accurate.

Affinity for HLA principally depends on the allele-specific pattern of conserved residues at particular positions in the peptide, the primary anchor motifs (Rammensee et al, 1995; Engelhard, 1994) . Although the large majority of immunogenic epitopes possess the allele-specific primary anchor motifs the presence of these motifs is not a sufficient condition for a peptide to show strong binding. Secondary anchors and deleterious residues at non-conserved positions also influence the peptide-HLA interaction (Ruppert et al, 1993) . Selection of high affinity peptides cannot therefore be achieved exclusively on the basis of the presence of primary anchor motifs - less than 30% of peptides having the allele specific primary anchor motifs exhibit strong binding (Gulukota et al , 1997) . Nor can the selection be carried out on the basis of the extended motif pattern, which takes into account not only primary but also secondary anchors and deleterious motifs. In fact, when this approach is used (Parker et al, 1994) no more than 50% of high affinity peptides are identified.

The main problem with the current motif based selection methods is the implicit assumption that the side chains of individual residues bind to the HLA in an independent manner. However, current knowledge on peptide-HLA interactions provided by recent crystallographic data, and by our own studies, does not fit with this assumption (Tourdot et al, 1997, Madden et al, 1993). For instance, D^b affinity seems to depend on the presence of favourable 'sequences' rather than on the presence of favourable ^λ residues' at positions interacting with the D^b molecule (Tourdot et al, 1997) . This strongly suggests that the efficient selection of high affinity immunogenic peptides requires a residue-independent, sequence-dependent affinity prediction model. Further, given the present state-of-the-art in sequence comparison it is unlikely that methods relying solely on identification of motifs in sequence space will be successful.

Grassy et al , (1998) disclose a method for computer-assisted rational design of immunosuppressive compounds. The reference describes the analysis of a set of peptides for immunosuppressive activity. A learning set of inactive and active peptides were analysed by a range of topological descriptors, and a set of topological descriptors for the active set of peptides was defined. The descriptors were used to screen a virtual combinatorial library of peptide which was generated based on a partially randomised lOmer consensus sequence in which positions 1, 5 and 10 were fixed as arginine, arginine, and tyrosine respectively.

Disclosure of the Invention.

We have developed novel methodology for the identification of regions within proteins which have a desired function, wherein the identification is independent of sequence homology.

The method utilises a residue independent computational model (SCIPS, for Sequence Comparison In Property Space) whose inputs are not sequences but physicochemical and topological parameters derived from those sequences. The ability of this approach to identify and classify within the same 'activity' family sequences that show very low homology has applications in any situation where relationships between distant sequence patterns are sought .

In a first aspect, the invention provides a method for ^• determining whether a query protein sequence has a functional property of interest, which method comprises:

(i) providing a dataset of proteins which share the functional property of interest ;

(ii) encoding a plurality of physicochemical and/or topological descriptor parameters for at least one frameset of each member of the dataset ;

(iii) determining a set of said parameters which describe the at least one frameset; and

(iv) scanning a query protein sequence for the presence of a frameset which matches said parameters.

In a second aspect, the invention provides a method for determining whether a query protein sequence has a functional property of interest, which method comprises: (i) providing a dataset of proteins which share the functional property of interest;

(ii) determining for each protein of said dataset at least one frameset, the frameset being a region within the protein which imparts to the protein said functional property of interest;

(iii) encoding a plurality of physicochemical and/or topological descriptor parameters for each frameset;

(iv) determining from the encoded descriptor parameters a set of descriptor parameters and their values which describe each frameset and which are indicative of said functional property of interest; and (v) scanning a query protein sequence for the presence of a frameset which matches said set of descriptor parameters .

A further aspect of the invention provides a computer system which is operatively configured to implement the method of the first or second aspect.

By "a computer system" we mean the hardware means, software means and data storage means used to determine whether a query protein sequence has a functional property of interest according to the present invention. The minimum hardware means of a computer-based system of the present invention typically comprises a central processing unit (CPU) , working memory and data storage, and e.g. input means, output means etc. The data storage may comprise magnetic storage media such as floppy discs, hard disc storage medium and magnetic tape; optical storage media such as optical discs or CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. Examples of such systems are microcomputer workstations available from Silicon Graphics Incorporated and Sun Microsystems running Unix based, Windows NT or IBM OS/2 operating systems.

Further aspects of the invention provide (i) computer software code for implementing the method of the first or second aspect, and (ii) a computer programming product carrying the software code .

These and further aspects of the invention are described herein below. Description of the Drawings.

Figure 1 shows a flow chart which schematically illustrates the invention.

Figure 2 shows histograms and corresponding normal distributions of the measured relative affinity (RA) values for 120 immunogenic and 72 non immunogenic sequences.

Figure 3A & B shows data relating to validation of the model

Figure 4 shows the immunogenicity of peptides belonging to the external validation set.

Figures 5a and b respectively show distributions of the BiMass and SFPEITHI scores for the high and low affinity peptides of the external validation set.

Detailed description of the invention.

Definitions .

Query Protein Sequence .

This is any protein sequence available to those of skill in the art, for example a protein sequence available on a public or private database, whether with an ascribed function or unknown function. The protein sequence may be of any species origin, including human, primate, mammalian, vertebrate, insect, yeast, eukaryotic or prokaryotic. The sequence may be a confirmed protein sequence (i.e. based on direct protein sequence or a translation of an mRNA) or a hypothetical protein sequence based upon an open reading frame of genomic DNA, or a de novo designed sequence. Dataset of Proteins .

The dataset of proteins may be a set of proteins associated with the functional property of interest. The proteins may also be of any species origin, including human, primate, mammalian, vertebrate, insect, yeast, eukaryotic or prokaryotic. Depending upon the property of interest, the dataset may be limited to proteins of a single species or it may comprise sequences from different organisms.

The dataset may further comprise synthetically generated sequences, for example sequences which have been synthesised in a random or semi-random manner and selected to have the property of interest .

The proteins may be full length sequences, partial sequences or peptides, or mixtures thereof. Where the sequences are partial sequences or peptides, these will be of sufficient length to be associated with the functional property. This will usually be at least 5, more generally at least 8, such as at least 9 or 10 amino acids in length.

The size of the dataset will be dependent upon a number of circumstances, including the number of sequences available in the art and the discrimination achieved by the topological parameters. Usually, it is desirable that the dataset comprises at least 5, preferably at least 10, more preferably at least 20 members. The maximum size of the dataset will be dependent only upon the numbers of sequences available and the computational power available for their analysis. However, a dataset of up to 500 members is feasible.

The term "proteins" in relation to the dataset is used to mean any polypeptide sequence, including short sequences (e.g. 5 or more amino acids) which are often referred to in the art as "peptides" or "polypeptides" .

Functional Property of Interest . This is any property associated with a set of sequences and known to a person of skill in the art. In the accompanying example, we have provided a dataset of peptides associated with HLA-binding properties. However, this is not limiting and the principle illustrated may be applied to any other property. Such properties include domains with enzymatic function, e.g. kinase, phosphatase or acetylase activity, domains which bind target molecules such as other proteins or DNA, domains which act as agonists or antagonists of biological function, proteins involved in signal transduction such as GPCRs, intracellular effector proteins and families as defined in the SCOP database (http://scop.mrc- lmb.cam.ac.uk/scop/) .

In one preferred aspect, the dataset is one which the proteins contain regions of sequence homology associated with said functional property of interest. By "sequence homology", it is meant that those of skill in the art operating available sequence alignment algorithms are able to determine at least one region in the members of the dataset whose sequences contain regions of alignment of statistical significance. For example, the algorithm BLAST, mentioned above, may be used with default parameters to determine regions of sequence homology of at least 30% identity. This level of homology is often associated with a common evolutionary origin and hence related function of the protein. The region of homology may not be across the entire length of the protein, but within a frameset of the protein, as defined herein.

Frameset .

Proteins which share a functional property of interest may do' so because of a region of sequence within the protein which imparts such a property to the protein. This is well known as such by those of skill in the art. For example, DNA- binding domains, zinc-finger domains, transmembrane domains, signal sequences and the like are found in subdomains of proteins and often such a subdomain may be transferred ("donated") to other "recipient" protein sequences to transfer the property from donor to recipient.

Sometimes, the property may be discontinuous within the protein. For example, the target binding of an antibody variable domain is primarily determined by the properties of three hypervariable regions in each of a light and heavy chain variable domain which are separated by framework regions .

In the present invention, we refer to the region within the protein associated with the property of interest as a frameset. The frameset may be continuous or discontinuous to cover 2 or 3 different regions. The frameset is the region from which the descriptor parameters are determined.

Generally, the frameset will define sequences of substantially the same length, although some variation in sequence length is permitted. This is because functional domains shared by proteins will often differ in length to some extent . The actual size of the frameset will be determined by factors which will vary in each case, taking account of the functional property of interest and the capacity of the computational model. For example, in the accompanying example the frameset is a 9mer-10mer size. The frameset may be in this range, or greater, for example in the range of from 8 to 50 amino acids, such as from 8 to 40, for example from.10 to 25 amino acids.

In the accompanying example, the frameset is the^' same size as the peptides of the proteins of the dataset . In other instances, the frameset may be defined as a sequence within a longer protein sequence. For example, a dataset of proteins which share a common property of interest may be aligned so that the region of each protein associated with the property is shown aligned, irrespective of where the region appears in relation to the rest of the protein in which it is located (see Figure 1 box 1) . This region is then used to determine a frameset, for which descriptors are calculated.

Physiochemical and Topological Descriptors . This includes any physical or chemical property of a molecule, including topological parameters of the molecule. It may include either properties which are "static" (at least in a time-averaged sense) such as the dipole moment of a molecule, and/or "dynamic", such as ones characterising the range of conformations through which the molecule may flex over a period of time. In the case of some molecules, the flexing of the molecule over time can be determined with high accuracy using modern molecular modelling techniques. In the present context, a frameset of a protein is also considered to be a molecule for which descriptor parameters can be measured or calculated.

A range of descriptors are defined in commercially available packages, such as TSAR™ software (Oxford Molecular, UK) .

Examples of descriptors are found in Grassy et al (1998) and in W099/33598. Descriptors include molar mass, ellipsoidal volume, molar volume, lipophilicity, dipole moment, total number of N or C or O atoms, number of methyl groups, and various topological, connectivity and shape indices, such as the Wiener or Balaban indices, the Kier Chi indices and the like. These and many other descriptors are defined in the "Tsar Reference Guide", published by Oxford Molecular (2000) .

These and other descriptors may be used in the present invention.

Description of the invention.

Figure 1 outlines an embodiment of the invention. In part (1) , a set .of protein sequences (shown by six parallel lines) each contain a region of homology (thick line) which is associated with a function common to all sequences.

The region of homology defines a frameset. The parameters of the frameset are encoded to provide a plurality of descriptors (step 2 of Figure 1) . Usually, from 10 to 100 descriptors, such as from 15 to 60 descriptors are encoded.

Desirably, a plurality of sequences which are not associated with the property are also encoded ("inactive sequences").

These may be randomly selected sequences. Inactive sequences will be of substantially the same size as the frameset, and similar in number. For example, where the number of protein sequences having activity is a number x, then from x/2 to 2x inactive sequences will likewise be encoded.

Having encoded the sequences, the descriptors are analysed in order to determine a set of descriptors and their values which describe the frameset (step 3 of Figure 1) . Preferably when inactive sequences have been encoded, the descriptors and their values are not common to the set of inactive sequences. A way of analysing and selecting the descriptors is as follows:

1- Intercorrelated descriptors are first either removed by using standard statistical practices, or decorrelated through algorithms known in the art like principal component analysis (PCA) or Gram-Schmidt orthogonalisation. 2- From the descriptors obtained in the previous step, a set of descriptors is selected in which space the set of active sequences- is well separated from the set of inactive sequences. The ideal situation is when there is no overlap between the two regions of the space. However, a person of skill in the art has various ways to handle cases where there is a certain overlap and to determine the smallest possible set of descriptors exhibiting the smallest possible overlapping region. Such descriptor space is not necessarily a linear one and various computational techniques are available in the art to perform such an analysis, including neural networks, genetic algorithms, partial least squares (PLS) , fuzzy logic and the like. The precise means of determination may be selected by a person of skill in the art, taking account of the nature and number of sequences to be analysed and personal preference. In the accompanying example we have used a neural network, and this is preferred. The process will provide a set of descriptor parameters for a frameset which are indicative of the property of interest . The number and nature of the descriptors selected will depend upon the frameset and property selected, and will differ on a case-by-case basis. Usually, about 15-40 descriptors will be selected, though this is not fixed. Having defined the descriptors, the method then comprises scanning a query protein sequence for the presence of a frameset which have parameters for the selected descriptors which match those which are indicative of the property. In Figure 1 frame 4 the frameset is shown as a box which is moved stepwise (e.g. 1 amino acid at a time) along, the protein sequence, wherein the value of the selected descriptors are calculated for the residues within each box. However, those of skill in the art will appreciate that this is schematic and not limiting, and that in the present context "scanning" means inspecting successive regions of the sequence. Thus for example, the scanning may be directed to pre-selected regions, or linearly along the protein sequence in steps of more than one amino acid.

For example, in the accompanying example, we have defined a set of descriptors for peptides which bind to HLA. The frameset of 9-10mers was defined by reference to a set of 22 descriptors. We also found that the frameset was characterised by strong motifs at positions 2 and the C- terminus. Thus each protein query sequence was pre-scanned to select framesets comprising these motifs, and these selected framesets were then analysed by the chosen descriptors.

The invention thus provides for the provision of novel peptides which have a property of interest, as well as for the identification of sequences which have a property of interest but which do not have sufficient sequence homology for this property to be identified by conventional methods.

Where the invention provides for the provision of novel peptides, a further aspect of the invention is a peptide obtained by the process of the invention, as well as a composition comprising said peptide plus a pharmaceutically acceptable carrier or diluent.

Peptides of particular interest which may be obtained in accordance with the invention include peptides which bind HLA Class I or Class II antigens, receptors and/or their cognate ligands, enzymes and protein-protein interaction inhibitors.

The invention is illustrated by the following example:

Introduction.

This example illustrates the invention in relation to a method for the prediction of HLA-A*0201 affinity, the selection of immunogenic HLA-A*0201 bound peptides, and the experimental test of the predictions. A unique feature of this computational method (SCIPS for Sequence Comparison In Property Space) is that it performs the selection in "property space" rather than sequence space and, as such, is capable of finding 'family' relationships among groups of peptides that are not identifiable using conventional sequence comparison methods. The model operates by the simulation of an artificial neural network whose inputs are not amino acid sequences but physicochemical and topological descriptors derived from those sequences . The model can identify 86.8% of high affinity peptides with a probability of correct prediction of 94.3%. More importantly, it is able to predict 88.6% of immunogenic peptides with an equally high probability (85.3%), leading to the possibility of creating an almost complete immunogenic epitope map of any tumour or virus antigen.

Immunogenic/nonimmunogenic peptide discrimination on the basis of HLA-A*0201. ffinity.

The aim of this study was to create a computational model for the identification of peptides exhibiting affinities for HLA- A*0201 sufficiently high to ensure their immunogenicity. It was therefore necessary to define the affinity threshold discriminating immunogenic from nonimmunogenic peptides under the experimental conditions of the HLA-A*0201 affinity measurements. 192 peptides with various HLA-A*0201 affinities were tested for their capacity to elicit a CTL response in HHD mice. Previously, we and other groups have shown that peptides immunogenic in HLA-A*0201 transgenic mice, such as HHD and A2/Kb mice, are also immunogenic in humans and, conversely, that peptides nonimmunogenic in HLA-A*0201 transgenic mice are nonimmunogenic in humans (van der Burg et al . , 1996, Bakker et al . , 1997). Each peptide was tested in more than twelve HHD mice in several independent experiments. Peptides were considered immunogenic when a) the specific , lysis of induced CTL was at least 15% above the nonspecific lysis and b) specific CTL were generated in more .than 20% of primed mice. According to these criteria 120 peptides were found immunogenic and 72 nonimmunogenic. The distribution of their relative affinity (RA) values is shown in Figure 2. As expected immunogenic peptides exhibited a much higher affinity (mean RA=0.01) than nonimmunogenic peptides (mean RA=1.08). The RA distributions of immunogenic and nonimmunogenic peptides exhibited only a small overlap. By using a confidence threshold of 90%, immunogenic peptides rank between RA -0.8 and 0.7 and nonimmunogenic peptides between RA 0.5 and 1.6. 91.1% of peptides with RA<0.7 were immunogenic. This percentage was 25% and 0.2% for peptides with RA=0.7-1 and RA>1 respectively. Since 113 out of 120 immunogenic peptides (94.2%) had an RA<0.7 and 61 out of 72 nonimmunogenic peptides (84.7%) had an RA>0.7, an RA=0.7 was considered as the affinity threshold discriminating immunogenic from nonimmunogenic peptides (p<0.0001).

Properties of the peptide database.

172 peptides extracted from 18 antigens were included in the database for the training of the neural network. 110 peptides were 9mers and 62 peptides were lOmers. 116 peptides exhibited a high (RA<0.7), 16 peptides an intermediate (RA=0.7-1.0) and 40 a low (RA>1..0) HLA-A*0201 affinity. Their sequence characteristics are illustrated in Table 1. Except at the anchor positions P2 and P9/10 occupied by any of L/M/l/V/A in the large majority of peptides (91.3%) there was a fair representation of all 20 amino acids.

Construction of the neural network model.

The architecture of the back propagation neural network, the transfer parameters and the convergence RMS, necessary to obtain good generalized performances, were optimised by trial and error with the help of the internal validation set formed by a random choice of 30% of the database. Numerous combinations of 60 descriptors were tested and an iterative selection procedure was followed by displaying the dependencies of the output variables on each input (descriptor) variable. For each descriptor combination, particular attention was paid to exclude combinations exhibiting a correlation of 0.7 or higher. Moreover, care was taken to keep the network sufficiently small in terms of the number of weights to be computed. In practical terms, the ratio p = Number of input samples / Number of weights to be evaluated was kept within the range 1.8 < p <2.2 (Tetko et al. , 1993) .

According to these rules of thumb, it was found that 22 input neurons (descriptors) , related to the size (Inertia Moment 1 size, Inertia Moment 2 size, Inertia Moment 1 length, Ellipsoidal volume) , the distribution of atomic partial charges (total dipole, dipole moment X, Y, Z component) , the lipophilicity (log P_oct_/ater/ total lipole, lipole X, Y, W component) and the topology (Kier chi3 cluster, Kier chiV3 cluster, Kier Chi 5 ring, Kier chi 6 ring, Balaban index, Flexibility Index, Number of Oxygen atoms, Number of Sulfur atoms) of the peptides studied, were required to describe the molecules. These descriptors were used in two neural networks architectures : 22 :3 :1 and 22 :4 :1.

A network comprising three neurones in the hidden layer, a noise level equal to 0.03 and a convergence rate of 0.03 was selected as the final model since the predictions allowed us to obtain the best simulation results. 120 peptides were used as the learning set and the remaining 52 peptides (39x9mers and 13xl0mers) were used as the internal validation set. Graphical plots of the experimental versus predicted RA values for the compounds constituting the internal validation set are shown in Figure 3A_X. 45 of the 52 peptides (86.5%) were correctly classified in the high and low affinity classes by using the threshold RA=0.7, with a good correlation between predicted and experimental RA (mean of ΔRA =0.4 and r=0.86) (Fig 3A_X) . The classification was efficient for both 9mers (84.6%) (Fig 3A₂) and lOmers (92.3%) (Fig 3 ₃) . To evaluate whether the model could efficiently identify strong binders (RA<0.7) we classified peptides in four different groups : true positive (TP; strong binders well predicted) , true negative (TN; intermediate/weak binders well predicted) , false positive (FP; predicted strong binders which exhibit in fact an intermediate/low affinity) and false negative (FN; predicted intermediate/weak binders which exhibit in fact a high affinity) . We then calculated four different parameters : a) sensitivity (SEN) : proportion of high affinity peptides that are correctly identified (TP/TP+FN) , b) specificity (SPE) : proportion of intermediate/low affinity peptides that are correctly identified (TN/TN+FP) , c) positive predictive value (PPV) : probability that a predicted strong binder has an experimental high affinity (TP/TP+FP) and d) negative predictive value (NPV) : probability that a predicted intermediate/weak binder has an experimental intermediate/low affinity (TN/TN+FN) . The ideal theoretical model for the identification of high affinity peptides must combine a high sensitivity (detection of a high percentage of strong binders) , a high specificity (good discrimination between strong and intermediate/weak binders) , high PPV (low percentage of false positive peptides) and a high NPV (low percentage of false negative peptides) . Figure 3 shows that 86.8% of strong binders were identified with a high probability (PPV=94.3% and NPV=70.2%). This percentage was 83.3% and 100% for the 9mers and lOmers respectively.

Prediction of new binding sequences. In order to exploit the SCIPS model for finding new binding sequences and to validate it by an external validation set, we computationally screened the sequences of four tumour antigens (hTERT, HER-2/neu, PSMA and NPM/ALK) . For hTERT and HER-2/neu these protein sequences were scanned for framesets of both 9mers and lOmers while for PSMA and NPM/ALK the frameset was limited to 9mers . Interest was focused on frames which defined peptides with the HLA-A*0201 specific primary anchor and strong motifs (L/M/V/I/A at P2 and C-terminal P) , referred hereafter as HLA-A*0201 motifs because these peptides are likely to be HLA-A*0201 epitopes. From a total of 556 peptides having the HLA-A*0201 motifs 135 peptides were predicted to have a high affinity (50 hTERT, 45 HER- 2/neu, 24 PSMA, 16 NPM/ALK) and 421 peptides were predicted to have an intermediate/low affinity (185 hTERT, 158 HER- 2/neu, 41 PSMA, 37 NPM/ALK) . 48 of the 556 peptides were randomly selected to be tested for their affinity in blind experiments (Table 2) .

The results are summarised in Figure 3B, which shows the plots of the experimental and predicted RA values of the 48x9/l0mers peptides (Bi) , divided into 37x9mers (B₂) and llxlOmers (B₃) .

In each figure the lower left and upper right areas contain peptides correctly predicted according to the threshold RA=0.7. The four parameters, sensitivity (SEN), specificity

(SPE) , PPV and NPV define the goodness of the SCIPS model and are calculated as follows : SEN = TP/ (TP+FN) , SPE =

TN/(TN+FP), PPV = TP/(TP+FP), NPV = TN/ (TN+FN) where TP (true positive) corresponds to strong binders well predicted, TN

(true negative) corresponds to intermediate/weak binders well predicted, FP (false positive) corresponds to predicted strong binders which exhibit, in fact, an intermediate/low affinity and FN (false negative) corresponds to predicted intermediate/weak binders which exhibit in fact a high affinity. 38 of the 48 peptides (79.2%) were well classified with a relatively good correlation between predicted and experimental RA (mean of ΔRA =0.5 and r=0.63) (Fig 3B) . Only 6 out of 48 peptides (12.5%) showed a ΔRA>1. For the remaining 42 peptides the mean ΔRA was 0.35 and the r=0.83. The classification was efficient for both 9mers (78.4%) (Fig 3B₂) and lOmers (81.8%) (Fig 3B₃) . Figure 3B_X also shows that 82.8% of strong binders were identified with a high probability (PPV=82.8% and NPV=73.7%) confirming results obtained with the internal validation set of peptides. This percentage was 85% and 77.8% for the 9mers and lOmers respectively.

The difference between predicted and experimental RA for the majority of peptides of the external validation set, represented by the mean ΔRA of 0.35, indicates that the SCIPS model does not include some experimentally high affinity peptides as strong binders. Most of these peptides must have a predicted RA ranging from 0.7 to 0.7+0.35 and, therefore, belong to the predicted intermediate affinity group. Using a predicted RA of 1.0 as a new threshold for the identification of strong binders, the sensitivity increased to 96% of the high affinity peptides. It was expected that this higher discrimination threshold would also result in more false positives, but the PPV was nevertheless high at 77%. Hence, the efficiency of the SCIPS model to identify strong binding sequences can be enhanced by raising the predicted RA threshold.

On the other hand, lowering the threshold of predicted RA to 0.3 enabled the elimination of some false positive and the identification of strong binders with a high probability. For a predicted RA=0.3, 91% of predicted strong binders were in fact strong binders and 68% of the high affinity peptides were still correctly classified. The probability of identifying strong binders increased to 100% as expected when a threshold of predicted RA=0 was used.

Correlation between immunogenicity and predicted RA

A crucial aim of the study described here was to create a prediction model able to identify HLA-A*0201 associated immunogenic peptides. Thus, it was important to confirm that predicted strong binders were, in fact, immunogenic while predicted intermediate/weak binders were not . Our interest was particularly focused on false positive and false negative peptides. Eighteen peptides of the external validation set were tested for their immunogenicity in HHD mice. Eight of them were strong (four true positive and four false negative) and ten intermediate/weak (five true negative and five false positive) binders. At least six mice were used for each peptide and results are shown in Figure 4. As expected, seven out of eight high affinity peptides were immunogenic while nine out of ten intermediate/low affinity peptides were not . It is noteworthy that the nonimmunogenic high affinity peptide (hTERT 934) was predicted to be intermediate/weak binder, while the immunogenic intermediate/weak binder (PSMA 130) was predicted to have a high affinity. To further evaluate the capacity of the SCIPS model to predict immunogenic peptides, the immunogenicities of 111 high affinity (97 true positive and 14 false negative) and 50 intermediate/low affinity (38 true negative and 12 false positive) peptides belonging to the learning set and the external and internal validation sets were assessed (Table 3) . None of peptide sequences were found in the murine equivalent of the human antigen from which they derived. Therefore, the absence of a CTL response could be attributed to their nonimmunogenicity and not to tolerance of the specific CTL repertoire. Each peptide was tested in more than six mice and in at least two independent experiments. As expected, 87.4% of high affinity peptides were immunogenic while 84% of intermediate/low affinity peptides were not. 93 out of 105 immunogenic peptides (88.6%) were predicted to be strong binders and 93 out of 109 peptides predicted to be strong binders (85.3%) were immunogenic. Interestingly, immunogenicity varied significantly between true positive and false negative (p=0.0159) and between true negative and false positive peptides (p=0.0137). 90.7% of true positive but only 64.3% of false negative peptides were immunogenic. Similarly, 7.9% of true negative but 41.6% of false positive peptides were immunogenic.

These results further underscore the capacity of the SCIPS model to predict HLA-A*0201 bound immunogenic peptides. The model allowed the identification of the majority of high affinity immunogenic peptides and also a significant percentage of intermediate/low affinity immunogenic peptides in the antigens studied.

The aim of this work was to create an HLA-A*0201 affinity prediction model capable of selecting strong HLA-A*0201 binders that, according to current immunological dogma, should also be immunogenic. The SCIPS model we describe allows the identification of almost all the high affinity but also a significant percentage of intermediate/low affinity immunogenic peptides. It represents, therefore, a powerful tool for the identification of immunogenic virus and tumour epitopes that could be used for specific vaccination. It is now well documented that virus and tumor antigens contain a large number of immunogenic epitopes (Menendez- Arias et al . , 1998; Cibotti et al . , 1992). The establishment of the complete immunogenic epitope map of these antigens could be of great immunotherapeutic interest for two reasons. The first derives from the assumption that the efficacy of vaccination depends mainly on the quality of the peptide specific CTL repertoire in terms of CTL frequency and avidity. This repertoire is established during the positive and negative selection in the thymus and its quality is different for one peptide compared with another (Theobald et al, 1997) . It seems, therefore, necessary for an efficient vaccination to select among all the immunogenic peptides, those which correspond to a high quality CTL repertoire. This is particularly important for antitumour vaccination where tumour antigens are non-mutated self-proteins and their specific CTL repertoire is strongly influenced by the mechanisms of negative selection (Disis et al . , 1996; Coletta et al . , 2000; Kast et al . , 1994). Second, the identification of a large number of immunogenic epitopes will allow a polyspecific vaccination that has been demonstrated to be more efficient than a monospecific vaccination (Oukka et al , 1996) .

The large majority of high affinity HLA-A*0201 epitopes have the specific anchor and strong residues (L/M/V/l/A) in P2 and C-terminal P (primary anchor motifs) . However, only 30% of peptides with primary anchor motifs exhibit a high affinity. This is due to the presence of secondary anchor motifs which are also involved either favorably or unfavorably in the peptide-HLA-A*0201 interaction. Extended motifs (primary and secondary anchor motifs) and statistical binding matrices have already been used to perform a search of high affinity immunogenic peptides (Parker et al . , 1994; Brusic et al . , 1994). Their use is based on the assumption that each amino acid in each position ^• contributes with a certain binding energy independent of the neighbouring residues and that the binding of a given peptide is the result of combining the contribution from the different residues. Multiplying the relevant matrix values should then give an indication of the binding affinity of different sequences. Such models have given a number of erroneous binding predictions (Gulukota et al . , 1997). The relative failure of these models to predict affinity correctly is due largely to the fact that affinity is not ' individual-residue-dependent ' but rather 'individual residue-independent and sequence-dependent' as suggested by crystallographic data from MHC/peptides complexes and by our previous analysis of peptide/Db interaction.

This dependency on sequence represents a limitation of all alignment-based prediction methods. In fact, the description of a linear sequence in simple amino acid space severely underdetermines the properties that sequence is likely to display in three-dimensional space. When a sequence is ' reduced to a string of atoms and atom groups for which 2-D descriptors can be calculated the properties exhibited by the sequence cease to depend a priori on the amino acid as an independent chemical unit, but are integrated along the entire sequence. To further convince the reader that it is the 'descriptor' definition and selection that allows relationships within sequence families to be derived, we compare the behaviour of frequently used algorithms for homology scoring in the context of immunological epitopes (the BiMass score, Parker et al . , 1994; and SYFPEITHI score, ' Ramm'ensee et al . , 1999), with our own description using the SCIPS model. In Figure 5a and b we have respectively plotted the BiMass and SYFPEITHI scores of all the peptides shown in Table 2 for both high and low affinity peptides. As can be clearly seen, there is no correlation between the BiMass and SYFPEITHI scores and the experimental binding affinity and BiMass and SYFPEITHI are unable to discriminate low and high affinity peptides and, as such, would be a poor predictor of potential epitopes. By contrast, in Figure 3A we show the same data set of peptides using the descriptor-based neural net described earlier. The discrimination between high and low affinity peptides is clearly visible.

Neural networks (ANN) models have been used by others for the prediction of HLA-A*0201-peptide binding affinity (Gulukota et al., 1997; Buus et al . , 1999). The unifying characteristic of these previous approaches is that they rely on a residue dependent description of the peptides in which the neural net is trained to discriminate binding from non binding peptides. As a result, Gulukota et al (1997) are able to predict less than 40% of strong binders of their internal validation set with a probability that this is a correct prediction of only 50%. By contrast, the SCIPS model of the present invention described here can identify 86.8% of high affinity peptides with a probability of correct prediction of 94.3% (Fig. 3) . More importantly, it is able to predict 88.6% of immunogenic peptides with an equally high probability (85.3%) . In this regard it is interesting that some peptides can be identified that have intermediate or low affinity but which are nonetheless immunogenic. Moreover, it is noteworthy that none of these models has proven its efficacy in predicting new binding sequences . The utility of the SCIPS model is not only limited to the identification of the majority of high affinity immunogenic HLA-A*0201 associated peptides of an antigen but may allow the design of high affinity immunogenic variants of the intermediate/low affinity nonimmunogenic peptides. In previous reports we have demonstrated that intermediate/low affinity peptides can be of great interest in certain cases of tumour and virus immunotherapy (Tourdot et al . , 1997) . However, their efficient use requires the design of high affinity variants able to generate a strong CTL response. Selecting these peptide variants on the basis of predicted RA=3 or even RA=0 would allow the identification high affinity subset with an expected success rate close to 100%.

In conclusion, the SCIPS model of the present invention allows for the first time the creation of complete immunogenic epitope maps of tumour and virus antigens (antigen CTL epitope Biomap™) . A similar approach is currently being developed for peptides presented by HLA-

B*0702 and HLA-A*0301. Along with HLA-A201, these three HLA molecules cover 80% of the Caucasian population. The long- term benefits of this strategy would be that a reliable prediction of immunogenicity could be generated from genome data. The sequences from the human genome could be translated to "antigen CTL epitope Biomaps" of potential self- reactivities of autoimmune and antitumor relevance whereas the sequences from the various microbial and virus genomes could be translated to "antigen CTL epitope Biomaps" of potential interest in vaccine development.

On a more general note, the SCIPS method may also be applied to the analysis of polypeptide sequences. Using a scanning frame of sequences (eg 10-15 residues) encoded in property space, any new. sequence may be assigned to its correct functional family.

MATERIAL AND METHODS

Peptides .The peptide synthesis was achieved by a classic Fmoc chemistry protocol on an Automated Multiple Peptide Synthesis instrument (AMS 422, ABIMED) .

Genera tion of CTL in HHD mice . HLA-A*0201 transgenic, β2m - /-, D^b-/- HHD mice (Pascolo et al, 1997) were injected sc with lOOμg of peptide emulsified in incomplete Freund's adjuvant (IFA) in the presence of 140μg of the I-A^b restricted HBVcore 128-140 T-helper epitope. After 11 days, spleen cells (5xl0⁷ cells in 10ml) were stimulated in vi tro with peptide (lOμM) . On day 6 of culture, the bulk responder populations were tested for specific cytotoxicity by using uncoated or peptide coated HLA-A*0201 expressing RMAS-HHD murine tumour cells.

Measurement of Peptide Relative Affini ty to HLA-A* 0201 . T2 cells (3xl0⁵ cells/ml) were incubated with various concentrations of peptides in serum-free RPMI 1640 medium supplemented with lOOng/ml of human β2m at 37°C for 16 hrs . Cells were then washed twice and stained with the BB7.2 mAb followed by FITC conjugated goat anti mouse Ig mAb to quantify the expression of HLA-A*0201. For each peptide concentration, the HLA-A*0201 specific staining was calculated as the % of the staining obtained with lOOμM of the reference peptide HIVpol 589 (IVGAETFYV) . The relative affinity (RA) is determined as : RA = (Concentration of peptide that induces 20% of HLA-A*0201 expression / Concentration of the reference peptide that induces 20% of HLA-A*0201 expression) and is expressed as logι₀. The lower the RA value, the stronger is the peptide binding to HLA- A*0201. The mean RA value for each peptide was determined from at least three independent experiments. In all experiments, 20% of HLA-A*0201 expression using the reference peptide was obtained at l-3μM.

Artificial neural network model (SCIPS model) and structure- property calculations . The approximate 3D structure of the peptides were modelled using known HLA/peptide complexes and superimposed using the SYBYL software (Sybyl 6 . 6 , TRIPOS, St. Louis, Missouri) . For each peptide, 60 descriptors available in software package TSAR^® 3.2 (Oxford Molecular pic, Oxford, England) were calculated. A multi-layered feed-forward neural network with error back-propagation training algorithm (TSAR^® 3.2, Oxford Molecular Ltd, Oxford, England) was applied to RA binding predictions using the values of the 22 descriptors selected from the initial 60 (see Results section) . Multi- layered feed-forward networks are highly non-linear tools for function approximation. A summation of the combined inputs is used to predict the output values via a transfer function. In this study we used a three-layer, fully connected architecture. The parametric model represented by this network can be mathematically formulated as :

y_m = ^Wjb.** ^~θ_n ) m = 1, , M

where K is the number of input nodes, N the number of hidden nodes and M the number of output nodes. x_k is the output of the input node k, θn is the bias of the input of hidden node n, W_kn is the weight connecting input node k to hidden node n, w_nm is the weight connecting hidden node n to output node m, and f is the activation function. The neural network implementation in TSAR^® uses an identity activation function. For our set of peptides, the artificial neural network calculated the difference between the predicted RA and the experimental values. This difference is used to adjust the weights in the hidden layers and to minimize the overall error. For testing the predictive ability of the SCIPS model, 30% of the input data were excluded from the learning set and used as an internal validation set.

Table 1. Amino acids occurrence at each sequence position in the HLA peptide database.

PI P2 P3 P4 P5 P6 P7 P8 P9t* P9** P10

A 11 2 12 20 11 8 17 12 5 5 3

G 9 0 10 24 13 18 13 3 0 0 0

L 16 133 30 5 18 20 26 28 57 7 18

I 6 8 3 5 8 8 6 1 8 2 5

M 3 17 8 2 ' 0 4 3 4 2 0 0

V 9 10 8 8 9 18 8 9 43 0 22

T 3 3 16 4 13 10 10 11 0 5 6

E 2 0 5 20 17 8 4 10 0 4 0

D 2 0 8 16 9 7 5 1 0 4 0

Q 7 1 19 8 11 5 8 9 0 1 0

K 7 0 3 3 3 1 5 6 0 0 0

Y 57 0 9 0 13 4 2 6 0 10 0

F 7 0 11 4 5 9 20 16 0 9 0

W 1 0 6 4 8 0 3 4 0 2 0

P 3 0 11 13 4 7 3 5 0 0 0

S 9 0 6 19 10 7 17 14 0 3 0 H 2 0 2 3 10 2 3 13 0 0 0

N 4 0 4 3 5 12 12 4 0 1 0

R 14 0 0 11 7 16 7 16 0 4 0

C 2 0 3 2 0 10 2 2 2 0 3 * c-terminal P9 of 9mers ** P9 of lOmers

Table 2. List of peptides selected by the SCIPS model. Each peptide shows a predicted relative affinity (see Methods) and its BiMass score.

Antigen RA Bi ass sequence experimental predicted Score hTERT_S4 ( :ιo) ILAKFLHWLM 0.5 0.7 63 hTERTi22 YLPNTVTDA 0.0 0.5 52 hTERT₁₂₂₍₁₀) YLPNTVTDAL 0.3 0.9 48 hTERT₃₈₁ RLPQRYWQM 0.8 0.9 56 hTERT₄₀₇ VLLKTHCPL 0.3 -0.2 134 hTERT₄₉₆ (10) SLGKHAKLSL 1.7 1.8 21 hTERT₅₁₁ (10) KMSVRGCAWL 1.4 1.4 297 hTERT₅₄₀ ILAKFLHWL -0.4 1.3 1745 hTERT₅₄₄ (10) FLHWLMSVYV -0.7 -0.1 1796 hTERT₅₄₇ (10) WLMSVYWEL -0.5 -0.2 835 hTERT₅₄₈ LMSVYWEL -1.0 -0.1 60 hTERTsie LLTSRLRFI 0.9 1.0 45 hTERT₇₇₂ YMRQFVAHL 0.5 0.5 47 hTERTa₇i (10) LLVTPHLTHA 0.4 -0.1 19 hTERT₉₂₆ (10) GLFPWCGLLL -0.3 -1.1 79 hTERT₉₃₄ LLDTRTLEV -0.4 1.0 47 h.TERT₉₆9 NMRRKLFGV 1.1 0.5 51 hTERToos (10) LLLQAYRFHA 0.7 0.5 181 hTERTiooβ LLQAYRFHA 0.3 -0.5 49 Antigen RA Bimass sequence experimental predic :ted Score hTERToiβ (io) VLQLPFHQQV 0.2 0.2 449 hTERTo₇₂ WLCHQAFLL -0.1 -0.3 569 hTERTιo₇8 (io) FLLKLTRHRV 0.3 0.8 1183

HER-2/neu ₈₃ YVLIAHNQV 0.6 0.1 103

HER-2/neu ₁₀_ _; QLFEDNYAL -0.3 0.2 324

HER-2/neu ₉₅_i YMIMVKCWM 1.6 1.5 90

HER-2/neu

TLSPGKNGV 1.7 1.7 69

1172

PSMA ₂₀ WLCAGALVL 0.7 0.0 40 PSMA ₁₃₀ IINEDGNEI 1.7 0.1 10 PSMA ₁₉₃ KINCSGKIV 1.8 1.4 14 PSMA 2oo IVIARYGKV 1.8 1.4 2 PSMA ₂₆₀ NLNGAGDPL 1.8 0.4 10

PSMA 286 AVGLPSIPV 0.1 ^■0.2 6 PSMA 354 RIYNVIGTL 1.7 1.4 3

PSMA ₄₇₃ LVHNLTKEL 1.7 1.5 3

PSMA 582 MVFELANSI 0.7 0.2 23

PSMA ₆₆₀ VLRMMNDQL 1.4 1.4 1 PSMA ₇u ALFDIESKV ^•0.2 0.2 1054 PSMA ₇₃₃ YVAAFTVQA 0.0 0.0 7

NPM/ALK ₄ SMDMDMSPL 0.7 ^■0.3 14 NPM/ALK ₄i QLSLRTVSL 1.4 0.3 21 NPM/ALK ₆₄ AMNYEGSPI 0.4 ^■0.3 7 NPM/ALK ₇₄ KVTLATLKM 1.7 1.6 1 NPM/ALK ₁₂₅ ELQAMQMEL 1.8 1.4 2 NPM/ALK 204 PLQVAVKTL 1.4 1.0 1 NPM/ALK ₂₄₁ CIGVSLQSL 1.8 0.3 7 Antigen RA Bimass sequence experimental predicted Score

NPM/ALK ₆i₄ SLLLEPSSL 0.7 0.1, 79 NPM/ALK ₆₅₀ GLPLEAATA 0.5 0.5 5 NPM/ALK ₆67 TILKSKNSM 1.7 1.7 2

Table 3. Relationship between immunogenicity and predicted RA.

Peptides Immunogemicity % of immunogenic

+

High affinity 97 ^' 14 87.4

True positive 88 9 90.7 p=0.0159

False negative 9 5 64.3

Intermediate/ low affinity 8 42 16 True negative 3 35 7.9 p=0.0137 False positive 5 7 41.6

The 161 peptides tested for their immunogenicity belong to the^' training and the validation (internal and external) sets, The chi-square test was used to compare immunogenicity between TP and FN and between TN and FP .

References :

Altschul SF et al J.Mol.Biol. 1990, Vol.215, p403.

Altschul SF et al Nucleic Acids Res. 1997 Vol.25 p3389.

Bakker, A.B.H., et al Int. J. Cancer. 1997. 70, 302-309.

Blundell,T Nature 1987, Vol.326 p347.

Bork P & Gibson TJ Meth. Enzymol . 1996 Vol.266 pl62.

Bray-JE, et al Protein Eng. 2000, Vol .13 , pl53.

Brusic, V., et al Prediction of MHC binding peptides using artificial neural network. In Complex Systems: Mechanism of Adaptation. Ed R.J. Stonier and X.H.You, IOS Press 1994.

Buus, S. et al Curr. Opinion. Immunol. 1999, 11, 209-213.

Chothia,C. Nature Vol.357, p543.

Cibotti, R., et al Proc . Natl . Acad. Sci. USA. 1992. 89: 416- 420.

Colella, T.A., et al J. Exp. Med. , 2000, 7, 1221-1231.

delaCruz, Thornton JM Protein Science 1999, Vol .8 , p750.

Deleage, G, et al , 1997, Biochimie 79; 681-686.

Disis M. L., et al J. Immunol. 156, 3151-3158, 1996.

Dubchak I et al Microbial . Comp .Genomics 1998 Vol .3 p543.

Fischer D & Eisenberg D Curr. Opin. Struct .Biol . 1999 Vol .9 p208.

Gamier J et al Meth. Enzymol. 1996 Vol.266 p540.

Grassy et al Nature Biotechnology 1998 Vol. 16 p748. Gribskov M et al Meth. Enzymol. 1990 Vol.183 p 146.

Gulukota, K. , J. et al J. Mol . Biol., 1997, 267, 1258-1267.

Holm,L & Sander, C. Nucleic Acid Res. 1994. Vol.22. p3600.

Kast, W.M., et al J. Immunol., 1994, 152, 3904-3912.

Laskowski RA et al J. Applied Cryst . 1993 Vol.26 p283.

Madden, D.R., et al Cell, 1993, 75, 693-708.

Menendez-Arias, L., et al Viral Immunol 1998, 11, 167-181.

Murzin AG et al J. Mol. Biol. 1995 247, 536-540 and (http: //scop.mrc-lmb. cam. ac.uk/scop/i-ndex. tml)

Needleman SB & Wunsch CD J. Mol. Biol. 1970 Vol.48 p443.

Orengo et al Curr .Opin. Struct .Biol . 1999 Vol .9 p374.

Orengo et al Nature 1994 Vol.372. p631.

Oukka, M et al . J. Immunol. 1996. 157: 3039-3045.

Parker, K.C., et al J. Immunol., 1994, 152; 163-175.

Pascolo, S., et al J. Exp. Med. 1997. 185: 2043-2051.

Pearson W & Lipman,DJ Proc .Natl .Acad. Sci .USA 1988 Vol.85, p2444.

Rammensee, H. G., et al, Curr. Opinion Immunol. 1994. 6 -. 13- 23.

Rammensee, H. G. et al , Immunogenetics, 1999, 50, 213-219.

Riley M Microbiol .Rev . 1993 Vol .3 p862.

Rost B & Sander C J. Mol. Biol. 1995 Vol. 23 p295. Ruppert, J., et al Cell. 1993. 74: 929-937.

Saqi MAS et al Protein Eng. 1998 Vol.11, 627.

Sette, A., et al . J. Immunol. 1994. 153: 5586-5592.

Smith TF & Waterman MS J.Mol.Biol. 1981 Vol.147 pl95.

Tetko, I., et al J.Med.Chem. 1993, 36, 811-814.

Theobald, M. , et al J. Exp. Med. 1997. 185: 833-841.

Thompson, J et al, Nucl. Acid Res. 1994, 22; 4673-4680.

Thomton J et al 1991 Nature 1991 Vol.354 No.6349 pp.105-106

Tourdot, S., et al J. Immunol. 1997. 159: 2391-2398.

van der Burg, et al . J. Immunol. 1996. 156: 3308-3314.

Wallin E & Heijne GV Protein Science 1998 Vol .7 pl029.

Claims

CLAIMS :

1. A method for determining whether a query protein sequence has a functional property of interest, which method comprises :

(i) providing a dataset of proteins which share the functional property of interest;

(iv) determining from the encoded descriptor parameters a set of descriptor parameters and their values which describe each frameset and which are indicative of said functional property of interest; and

(v) scanning a query protein sequence for the presence of a frameset which matches said set of descriptor parameters.

2. A method according to claim 1, wherein the dataset proteins contain regions of sequence homology associated with said functional property.

3. A method according to claim 1 or 2 , wherein said descriptor parameters are determined by a computational model which comprises a neural network.

4. A method according to claim 3, wherein the inputs to the computational model do not include peptide residue sequence data.

5. A method according to any one of claims 1 to 4, wherein the set of descriptor parameters comprises between 10 and 30 such parameters .

6. The method of any one of claims 1 to 5 wherein the frameset is from 8 to 40 amino acids in length.

7. The method of any one of claims 1 to 6 wherein the frameset is discontinuous.

8. A computer system which is operatively configured to implement the method of any one of claims 1 to 7.

9. Computer software code for implementing the method of any one of claims 1 or 7.

10. A computer programming product carrying the software code of claim 9.