WO2003015001A2 - Method for identification of protein function - Google Patents

Method for identification of protein function Download PDF

Info

Publication number
WO2003015001A2
WO2003015001A2 PCT/GB2002/003244 GB0203244W WO03015001A2 WO 2003015001 A2 WO2003015001 A2 WO 2003015001A2 GB 0203244 W GB0203244 W GB 0203244W WO 03015001 A2 WO03015001 A2 WO 03015001A2
Authority
WO
WIPO (PCT)
Prior art keywords
peptides
sequence
protein
frameset
sequences
Prior art date
Application number
PCT/GB2002/003244
Other languages
French (fr)
Other versions
WO2003015001A3 (en
Inventor
Elie Giraud
Jérôme GOMAR
Konstandino Kosmatopoulos
Roger Lahana
Anthony Rees
Original Assignee
Synt:Em S.A.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synt:Em S.A. filed Critical Synt:Em S.A.
Publication of WO2003015001A2 publication Critical patent/WO2003015001A2/en
Publication of WO2003015001A3 publication Critical patent/WO2003015001A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to methods of determining functions of protein sequences using computational methods.
  • the functional signatures may not map directly onto structure, so that dissimilar structures may have functional similarity (e.g. bacterial and mammalian serine proteases) .
  • three dimensional structural information may be insufficient and additional information, such as sequence motifs, common residue clusters or characteristic surface properties, will be required (see Orengo et al , 1999) .
  • PROSITE a method for detecting active sites and patterns in protein sequences (Bairoch, 1991) ; PFAM/SCOP analysis, a protein domain assignment method ' (Murzin et al , 1995); COGs (Clusters of Orthologous Groups) analysis, a method of protein function prediction; Superfamily assignments; Functional categories assignment
  • fold assignment by sequence similarity to protein of known 3D structure.
  • the reported fraction of- fold assignments in the various genomes amounts only to about 10-20% of sequences.
  • fold assignment include the steps of :
  • step 6 Checking the 'goodness' of the postulated model (e.g. using PROCHECK) .
  • the process may be repeated from step 1 one or more times until step 6 is acceptable.
  • sequence profiling Gribshov et al (1990)
  • sequence motif searching Bork & Gibson, 1996)
  • Sequence profile methods use evolutionary information from neighbouring sequences in the sequence database to build a profile.
  • An iterative sequence profile method able to detect distant relationships is PSI-BLAST.
  • a further method is fold assignment, or 'threading'-. These methods explicitly incorporate structural information from available 3-D protein structures. In many cases these methods can detect distantly diverged proteins as well as unrelated proteins with a similar fold (see PROTEINS: 23(5), 1995; Suppl 1, 1997 and Suppl 3, 1999 for further details of methods and results) . However, unless at least one 3D protein structure of the family is known, the method will be unable to assign the new sequence to a structural or functional family.
  • Fold assignment methods are at least as successful as sequence profiling methods and, in addition, are able to assign another 10-15 % of open reading frames (ORFs) from genome sequencing projects. Furthermore, some of the predictions from folds assignment methods are not detectable using sequence based methods. Conversely, sequence based methods sometimes identify distant relationships that fold assignment methods do not detect. This is because the sequence methods incorporate evolutionary information from neighbouring sequences whereas traditional fold methods typically do not.
  • ORFs open reading frames
  • Another problem in relation to protein structure is in providing an accurate and reliable theoretical method to identify peptides that bind with high affinity to HLA molecules.
  • the identification of tumour and virus immunogenic epitopes is of great importance for the design of tumour and virus vaccines.
  • the most common property of all the immunogenic peptides is their high affinity for the HLA molecule (Sette et al, 1994; Oukka et al , 1996, van den Burg et al, 1996; Tourdot et al, 1997) .
  • a reliable theoretical method to identify peptide sequences within a given antigen that bind strongly to HLA would, therefore, be of great utility for the selection of immunogenic peptides, provided it is both efficient and accurate.
  • Affinity for HLA principally depends on the allele-specific pattern of conserved residues at particular positions in the peptide, the primary anchor motifs (Rammensee et al, 1995; Engelhard, 1994) . Although the large majority of immunogenic epitopes possess the allele-specific primary anchor motifs the presence of these motifs is not a sufficient condition for a peptide to show strong binding. Secondary anchors and deleterious residues at non-conserved positions also influence the peptide-HLA interaction (Ruppert et al, 1993) .
  • Grassy et al disclose a method for computer-assisted rational design of immunosuppressive compounds.
  • the reference describes the analysis of a set of peptides for immunosuppressive activity.
  • a learning set of inactive and active peptides were analysed by a range of topological descriptors, and a set of topological descriptors for the active set of peptides was defined.
  • the descriptors were used to screen a virtual combinatorial library of peptide which was generated based on a partially randomised lOmer consensus sequence in which positions 1, 5 and 10 were fixed as arginine, arginine, and tyrosine respectively.
  • the method utilises a residue independent computational model (SCIPS, for Sequence Comparison In Property Space) whose inputs are not sequences but physicochemical and topological parameters derived from those sequences.
  • SCIPS residue independent computational model
  • the invention provides a method for • determining whether a query protein sequence has a functional property of interest, which method comprises:
  • the invention provides a method for determining whether a query protein sequence has a functional property of interest, which method comprises: (i) providing a dataset of proteins which share the functional property of interest;
  • a further aspect of the invention provides a computer system which is operatively configured to implement the method of the first or second aspect.
  • a computer system we mean the hardware means, software means and data storage means used to determine whether a query protein sequence has a functional property of interest according to the present invention.
  • the minimum hardware means of a computer-based system of the present invention typically comprises a central processing unit (CPU) , working memory and data storage, and e.g. input means, output means etc.
  • the data storage may comprise magnetic storage media such as floppy discs, hard disc storage medium and magnetic tape; optical storage media such as optical discs or CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. Examples of such systems are microcomputer workstations available from Silicon Graphics Incorporated and Sun Microsystems running Unix based, Windows NT or IBM OS/2 operating systems.
  • Figure 1 shows a flow chart which schematically illustrates the invention.
  • Figure 2 shows histograms and corresponding normal distributions of the measured relative affinity (RA) values for 120 immunogenic and 72 non immunogenic sequences.
  • Figure 3A & B shows data relating to validation of the model
  • Figure 4 shows the immunogenicity of peptides belonging to the external validation set.
  • Figures 5a and b respectively show distributions of the BiMass and SFPEITHI scores for the high and low affinity peptides of the external validation set.
  • the protein sequence may be of any species origin, including human, primate, mammalian, vertebrate, insect, yeast, eukaryotic or prokaryotic.
  • the sequence may be a confirmed protein sequence (i.e. based on direct protein sequence or a translation of an mRNA) or a hypothetical protein sequence based upon an open reading frame of genomic DNA, or a de novo designed sequence. Dataset of Proteins .
  • the dataset of proteins may be a set of proteins associated with the functional property of interest.
  • the proteins may also be of any species origin, including human, primate, mammalian, vertebrate, insect, yeast, eukaryotic or prokaryotic.
  • the dataset may be limited to proteins of a single species or it may comprise sequences from different organisms.
  • the dataset may further comprise synthetically generated sequences, for example sequences which have been synthesised in a random or semi-random manner and selected to have the property of interest .
  • the proteins may be full length sequences, partial sequences or peptides, or mixtures thereof. Where the sequences are partial sequences or peptides, these will be of sufficient length to be associated with the functional property. This will usually be at least 5, more generally at least 8, such as at least 9 or 10 amino acids in length.
  • the size of the dataset will be dependent upon a number of circumstances, including the number of sequences available in the art and the discrimination achieved by the topological parameters. Usually, it is desirable that the dataset comprises at least 5, preferably at least 10, more preferably at least 20 members. The maximum size of the dataset will be dependent only upon the numbers of sequences available and the computational power available for their analysis. However, a dataset of up to 500 members is feasible.
  • proteins in relation to the dataset is used to mean any polypeptide sequence, including short sequences (e.g. 5 or more amino acids) which are often referred to in the art as “peptides” or “polypeptides” .
  • Functional Property of Interest This is any property associated with a set of sequences and known to a person of skill in the art.
  • Such properties include domains with enzymatic function, e.g.
  • kinase kinase, phosphatase or acetylase activity
  • domains which bind target molecules such as other proteins or DNA
  • domains which act as agonists or antagonists of biological function proteins involved in signal transduction such as GPCRs, intracellular effector proteins and families as defined in the SCOP database (http://scop.mrc- lmb.cam.ac.uk/scop/) .
  • the dataset is one which the proteins contain regions of sequence homology associated with said functional property of interest.
  • sequence homology it is meant that those of skill in the art operating available sequence alignment algorithms are able to determine at least one region in the members of the dataset whose sequences contain regions of alignment of statistical significance.
  • the algorithm BLAST mentioned above, may be used with default parameters to determine regions of sequence homology of at least 30% identity. This level of homology is often associated with a common evolutionary origin and hence related function of the protein.
  • the region of homology may not be across the entire length of the protein, but within a frameset of the protein, as defined herein.
  • Proteins which share a functional property of interest may do' so because of a region of sequence within the protein which imparts such a property to the protein. This is well known as such by those of skill in the art. For example, DNA- binding domains, zinc-finger domains, transmembrane domains, signal sequences and the like are found in subdomains of proteins and often such a subdomain may be transferred ("donated") to other "recipient" protein sequences to transfer the property from donor to recipient.
  • the property may be discontinuous within the protein.
  • the target binding of an antibody variable domain is primarily determined by the properties of three hypervariable regions in each of a light and heavy chain variable domain which are separated by framework regions .
  • the frameset may be continuous or discontinuous to cover 2 or 3 different regions.
  • the frameset is the region from which the descriptor parameters are determined.
  • the frameset will define sequences of substantially the same length, although some variation in sequence length is permitted. This is because functional domains shared by proteins will often differ in length to some extent .
  • the actual size of the frameset will be determined by factors which will vary in each case, taking account of the functional property of interest and the capacity of the computational model.
  • the frameset is a 9mer-10mer size.
  • the frameset may be in this range, or greater, for example in the range of from 8 to 50 amino acids, such as from 8 to 40, for example from.10 to 25 amino acids.
  • the frameset is the ' same size as the peptides of the proteins of the dataset .
  • the frameset may be defined as a sequence within a longer protein sequence.
  • a dataset of proteins which share a common property of interest may be aligned so that the region of each protein associated with the property is shown aligned, irrespective of where the region appears in relation to the rest of the protein in which it is located (see Figure 1 box 1) . This region is then used to determine a frameset, for which descriptors are calculated.
  • Physiochemical and Topological Descriptors This includes any physical or chemical property of a molecule, including topological parameters of the molecule. It may include either properties which are "static” (at least in a time-averaged sense) such as the dipole moment of a molecule, and/or "dynamic", such as ones characterising the range of conformations through which the molecule may flex over a period of time. In the case of some molecules, the flexing of the molecule over time can be determined with high accuracy using modern molecular modelling techniques. In the present context, a frameset of a protein is also considered to be a molecule for which descriptor parameters can be measured or calculated.
  • descriptors examples include molar mass, ellipsoidal volume, molar volume, lipophilicity, dipole moment, total number of N or C or O atoms, number of methyl groups, and various topological, connectivity and shape indices, such as the Wiener or Balaban indices, the Kier Chi indices and the like. These and many other descriptors are defined in the "Tsar Reference Guide", published by Oxford Molecular (2000) .
  • Figure 1 outlines an embodiment of the invention.
  • a set .of protein sequences (shown by six parallel lines) each contain a region of homology (thick line) which is associated with a function common to all sequences.
  • the region of homology defines a frameset.
  • the parameters of the frameset are encoded to provide a plurality of descriptors (step 2 of Figure 1) .
  • active sequences a plurality of sequences which are not associated with the property are also encoded.
  • Inactive sequences will be of substantially the same size as the frameset, and similar in number. For example, where the number of protein sequences having activity is a number x, then from x/2 to 2x inactive sequences will likewise be encoded.
  • the descriptors are analysed in order to determine a set of descriptors and their values which describe the frameset (step 3 of Figure 1) .
  • the descriptors and their values are not common to the set of inactive sequences.
  • a way of analysing and selecting the descriptors is as follows:
  • Intercorrelated descriptors are first either removed by using standard statistical practices, or decorrelated through algorithms known in the art like principal component analysis (PCA) or Gram-Schmidt orthogonalisation.
  • PCA principal component analysis
  • Gram-Schmidt orthogonalisation 2- From the descriptors obtained in the previous step, a set of descriptors is selected in which space the set of active sequences- is well separated from the set of inactive sequences. The ideal situation is when there is no overlap between the two regions of the space.
  • a person of skill in the art has various ways to handle cases where there is a certain overlap and to determine the smallest possible set of descriptors exhibiting the smallest possible overlapping region.
  • Such descriptor space is not necessarily a linear one and various computational techniques are available in the art to perform such an analysis, including neural networks, genetic algorithms, partial least squares (PLS) , fuzzy logic and the like.
  • the precise means of determination may be selected by a person of skill in the art, taking account of the nature and number of sequences to be analysed and personal preference. In the accompanying example we have used a neural network, and this is preferred.
  • the process will provide a set of descriptor parameters for a frameset which are indicative of the property of interest .
  • the number and nature of the descriptors selected will depend upon the frameset and property selected, and will differ on a case-by-case basis. Usually, about 15-40 descriptors will be selected, though this is not fixed.
  • the method then comprises scanning a query protein sequence for the presence of a frameset which have parameters for the selected descriptors which match those which are indicative of the property.
  • a frameset which have parameters for the selected descriptors which match those which are indicative of the property.
  • the frameset is shown as a box which is moved stepwise (e.g. 1 amino acid at a time) along, the protein sequence, wherein the value of the selected descriptors are calculated for the residues within each box.
  • stepwise e.g. 1 amino acid at a time
  • the scanning means inspecting successive regions of the sequence.
  • the scanning may be directed to pre-selected regions, or linearly along the protein sequence in steps of more than one amino acid.
  • the invention thus provides for the provision of novel peptides which have a property of interest, as well as for the identification of sequences which have a property of interest but which do not have sufficient sequence homology for this property to be identified by conventional methods.
  • a further aspect of the invention is a peptide obtained by the process of the invention, as well as a composition comprising said peptide plus a pharmaceutically acceptable carrier or diluent.
  • Peptides of particular interest which may be obtained in accordance with the invention include peptides which bind HLA Class I or Class II antigens, receptors and/or their cognate ligands, enzymes and protein-protein interaction inhibitors.
  • This example illustrates the invention in relation to a method for the prediction of HLA-A*0201 affinity, the selection of immunogenic HLA-A*0201 bound peptides, and the experimental test of the predictions.
  • a unique feature of this computational method is that it performs the selection in "property space” rather than sequence space and, as such, is capable of finding 'family' relationships among groups of peptides that are not identifiable using conventional sequence comparison methods.
  • the model operates by the simulation of an artificial neural network whose inputs are not amino acid sequences but physicochemical and topological descriptors derived from those sequences .
  • the model can identify 86.8% of high affinity peptides with a probability of correct prediction of 94.3%. More importantly, it is able to predict 88.6% of immunogenic peptides with an equally high probability (85.3%), leading to the possibility of creating an almost complete immunogenic epitope map of any tumour or virus antigen.
  • the aim of this study was to create a computational model for the identification of peptides exhibiting affinities for HLA- A*0201 sufficiently high to ensure their immunogenicity. It was therefore necessary to define the affinity threshold discriminating immunogenic from nonimmunogenic peptides under the experimental conditions of the HLA-A*0201 affinity measurements. 192 peptides with various HLA-A*0201 affinities were tested for their capacity to elicit a CTL response in HHD mice.
  • HLA-A*0201 transgenic mice such as HHD and A2/Kb mice
  • HHD and A2/Kb mice are also immunogenic in humans and, conversely, that peptides nonimmunogenic in HLA-A*0201 transgenic mice are nonimmunogenic in humans
  • Each peptide was tested in more than twelve HHD mice in several independent experiments.
  • Peptides were considered immunogenic when a) the specific , lysis of induced CTL was at least 15% above the nonspecific lysis and b) specific CTL were generated in more .than 20% of primed mice.
  • 172 peptides extracted from 18 antigens were included in the database for the training of the neural network.
  • 110 peptides were 9mers and 62 peptides were lOmers.
  • Their sequence characteristics are illustrated in Table 1. Except at the anchor positions P2 and P9/10 occupied by any of L/M/l/V/A in the large majority of peptides (91.3%) there was a fair representation of all 20 amino acids.
  • the architecture of the back propagation neural network, the transfer parameters and the convergence RMS, necessary to obtain good generalized performances, were optimised by trial and error with the help of the internal validation set formed by a random choice of 30% of the database. Numerous combinations of 60 descriptors were tested and an iterative selection procedure was followed by displaying the dependencies of the output variables on each input (descriptor) variable. For each descriptor combination, particular attention was paid to exclude combinations exhibiting a correlation of 0.7 or higher. Moreover, care was taken to keep the network sufficiently small in terms of the number of weights to be computed. In practical terms, the ratio p Number of input samples / Number of weights to be evaluated was kept within the range 1.8 ⁇ p ⁇ 2.2 (Tetko et al. , 1993) .
  • a network comprising three neurones in the hidden layer, a noise level equal to 0.03 and a convergence rate of 0.03 was selected as the final model since the predictions allowed us to obtain the best simulation results.
  • 120 peptides were used as the learning set and the remaining 52 peptides (39x9mers and 13xl0mers) were used as the internal validation set.
  • SEN sensitivity
  • SPE specificity
  • NPV negative predictive value
  • the ideal theoretical model for the identification of high affinity peptides must combine a high sensitivity (detection of a high percentage of strong binders) , a high specificity (good discrimination between strong and intermediate/weak binders) , high PPV (low percentage of false positive peptides) and a high NPV (low percentage of false negative peptides) .
  • HLA-A*0201 motifs because these peptides are likely to be HLA-A*0201 epitopes. From a total of 556 peptides having the HLA-A*0201 motifs 135 peptides were predicted to have a high affinity (50 hTERT, 45 HER- 2/neu, 24 PSMA, 16 NPM/ALK) and 421 peptides were predicted to have an intermediate/low affinity (185 hTERT, 158 HER- 2/neu, 41 PSMA, 37 NPM/ALK) . 48 of the 556 peptides were randomly selected to be tested for their affinity in blind experiments (Table 2) .
  • Figure 3B shows the plots of the experimental and predicted RA values of the 48x9/l0mers peptides (Bi) , divided into 37x9mers (B 2 ) and llxlOmers (B 3 ) .
  • TN/(TN+FP) TN/(TN+FP)
  • PPV TP/(TP+FP)
  • NPV TN/ (TN+FN) where TP (true positive) corresponds to strong binders well predicted
  • a predicted RA of 1.0 as a new threshold for the identification of strong binders, the sensitivity increased to 96% of the high affinity peptides. It was expected that this higher discrimination threshold would also result in more false positives, but the PPV was nevertheless high at 77%.
  • the efficiency of the SCIPS model to identify strong binding sequences can be enhanced by raising the predicted RA threshold.
  • the aim of this work was to create an HLA-A*0201 affinity prediction model capable of selecting strong HLA-A*0201 binders that, according to current immunological dogma, should also be immunogenic.
  • the SCIPS model we describe allows the identification of almost all the high affinity but also a significant percentage of intermediate/low affinity immunogenic peptides. It represents, therefore, a powerful tool for the identification of immunogenic virus and tumour epitopes that could be used for specific vaccination. It is now well documented that virus and tumor antigens contain a large number of immunogenic epitopes (Menendez- Arias et al . , 1998; Cibotti et al . , 1992).
  • the establishment of the complete immunogenic epitope map of these antigens could be of great immunotherapeutic interest for two reasons.
  • tumour antigens are non-mutated self-proteins and their specific CTL repertoire is strongly influenced by the mechanisms of negative selection (Disis et al . , 1996; Coletta et al . , 2000; Kast et al . , 1994).
  • Second, the identification of a large number of immunogenic epitopes will allow a polyspecific vaccination that has been demonstrated to be more efficient than a monospecific vaccination (Oukka et al , 1996) .
  • HLA-A*0201 epitopes have the specific anchor and strong residues (L/M/V/l/A) in P2 and C-terminal P (primary anchor motifs) .
  • primary anchor motifs only 30% of peptides with primary anchor motifs exhibit a high affinity. This is due to the presence of secondary anchor motifs which are also involved either favorably or unfavorably in the peptide-HLA-A*0201 interaction.
  • Extended motifs (primary and secondary anchor motifs) and statistical binding matrices have already been used to perform a search of high affinity immunogenic peptides (Parker et al . , 1994; Brusic et al . , 1994).
  • ANN Neural networks
  • SCIPS model of the present invention allows for the first time the creation of complete immunogenic epitope maps of tumour and virus antigens (antigen CTL epitope BiomapTM) .
  • a similar approach is currently being developed for peptides presented by HLA-
  • HLA-A201 B*0702 and HLA-A*0301.
  • HLA-A201 these three HLA molecules cover 80% of the Caucasian population.
  • the long- term benefits of this strategy would be that a reliable prediction of immunogenicity could be generated from genome data.
  • the sequences from the human genome could be translated to "antigen CTL epitope Biomaps" of potential self- reactivities of autoimmune and antitumor relevance whereas the sequences from the various microbial and virus genomes could be translated to "antigen CTL epitope Biomaps" of potential interest in vaccine development.
  • SCIPS method may also be applied to the analysis of polypeptide sequences. Using a scanning frame of sequences (eg 10-15 residues) encoded in property space, any new. sequence may be assigned to its correct functional family.
  • HLA-A*0201 transgenic, ⁇ 2m - /-, D b -/- HHD mice (Pascolo et al, 1997) were injected sc with lOO ⁇ g of peptide emulsified in incomplete Freund's adjuvant (IFA) in the presence of 140 ⁇ g of the I-A b restricted HBVcore 128-140 T-helper epitope.
  • IFA incomplete Freund's adjuvant
  • spleen cells (5xl0 7 cells in 10ml) were stimulated in vi tro with peptide (lO ⁇ M) .
  • the bulk responder populations were tested for specific cytotoxicity by using uncoated or peptide coated HLA-A*0201 expressing RMAS-HHD murine tumour cells.
  • T2 cells (3xl0 5 cells/ml) were incubated with various concentrations of peptides in serum-free RPMI 1640 medium supplemented with lOOng/ml of human ⁇ 2m at 37°C for 16 hrs . Cells were then washed twice and stained with the BB7.2 mAb followed by FITC conjugated goat anti mouse Ig mAb to quantify the expression of HLA-A*0201. For each peptide concentration, the HLA-A*0201 specific staining was calculated as the % of the staining obtained with lOO ⁇ M of the reference peptide HIVpol 589 (IVGAETFYV) .
  • RA (Concentration of peptide that induces 20% of HLA-A*0201 expression / Concentration of the reference peptide that induces 20% of HLA-A*0201 expression) and is expressed as log ⁇ 0 .
  • the mean RA value for each peptide was determined from at least three independent experiments. In all experiments, 20% of HLA-A*0201 expression using the reference peptide was obtained at l-3 ⁇ M.
  • Multi- layered feed-forward networks are highly non-linear tools for function approximation. A summation of the combined inputs is used to predict the output values via a transfer function. In this study we used a three-layer, fully connected architecture. The parametric model represented by this network can be mathematically formulated as :
  • K is the number of input nodes, N the number of hidden nodes and M the number of output nodes.
  • x k is the output of the input node k
  • ⁇ n is the bias of the input of hidden node n
  • W kn is the weight connecting input node k to hidden node n
  • w nm is the weight connecting hidden node n to output node m
  • f is the activation function.
  • the neural network implementation in TSAR ® uses an identity activation function. For our set of peptides, the artificial neural network calculated the difference between the predicted RA and the experimental values. This difference is used to adjust the weights in the hidden layers and to minimize the overall error. For testing the predictive ability of the SCIPS model, 30% of the input data were excluded from the learning set and used as an internal validation set.
  • Table 2 List of peptides selected by the SCIPS model. Each peptide shows a predicted relative affinity (see Methods) and its BiMass score.
  • Antigen RA Bi ass sequence experimental predicted Score hTERT S4 ( : ⁇ o) ILAKFLHWLM 0.5 0.7 63 hTERTi22 YLPNTVTDA 0.0 0.5 52 hTERT 122(10 ) YLPNTVTDAL 0.3 0.9 48 hTERT 381 RLPQRYWQM 0.8 0.9 56 hTERT 407 VLLKTHCPL 0.3 -0.2 134 hTERT 496 (10) SLGKHAKLSL 1.7 1.8 21 hTERT 511 (10) KMSVRGCAWL 1.4 1.4 297 hTERT 540 ILAKFLHWL -0.4 1.3 1745 hTERT 544 (10) FLHWLMSVYV -0.7 -0.1 1796 hTERT 547 (10) WLMSVYWEL -0.5 -0.2 835 hTERT 548 LMSVYWEL -1.0 -0.1 60 hTERTsie LLTSRLRFI 0.9 1.0 45 hTERT 7
  • the 161 peptides tested for their immunogenicity belong to the ' training and the validation (internal and external) sets, The chi-square test was used to compare immunogenicity between TP and FN and between TN and FP .

Abstract

The present invention relates to a method of determining functions of protein sequences using computational methods. The method involves the identification of regions within a query protein which have a desired function, wherein the identification is independent of sequence homology. The method utilises a residue independent computational model whose inputs are not sequences but physicochemical and topological parameters derived from those sequences.

Description

METHOD FOR IDENTIFICATION OF PROTEIN FUNCTION
The present invention relates to methods of determining functions of protein sequences using computational methods.
It has been estimated that of the many thousands of genes (in excess of 30,000) in the human genome approximately 50% will initially be of unknown function (Fischer & Eisenberg 1999) . One route to understanding the function of proteins is via their 3-D structures. The comparison of 3D protein structures has shown that tertiary structure is more conserved than primary structure during evolution (Blundell et al, 1987). A large number of proteins are known that have similar or related functions, have the same fold but often have low homology at the level of the protein sequence (Thornton et al., 1991; Holm and Sander, 1994). It has been estimated that up to one third of known sequences would be homologous to at least one protein of known 3D structure (Chothia, 1992; Orengo et al, 1994) .
The problem is that currently, only a few thousand 3D protein structures are known while several tens of thousands of primary sequences are known. Many proteins have similar functions, for example kinase activity, or the ability to direct the protein to a particular sub-cellular compartment. In determining the function of the sequences, an approach is to look for sequence homologies to proteins - or regions thereof - whose function is known. A problem in the art is currently how to handle sequences whose homology to known protein families is below the level that current methods are able to assign to a particular structural or functional family. Of course, if the sequence is entirely new so that no other protein is known with that structure or function, then by definition, it is a new family member. Again, for some protein families the functional signatures may not map directly onto structure, so that dissimilar structures may have functional similarity (e.g. bacterial and mammalian serine proteases) . In these cases, three dimensional structural information may be insufficient and additional information, such as sequence motifs, common residue clusters or characteristic surface properties, will be required (see Orengo et al , 1999) .
Numerous methods have been developed that seek to provide structural/functional relationships between protein sequences where the information is extracted directly from the sequence. Some frequently used examples are listed below, details of which are found on the ExPASy Molecular. Biology Server www.expasy.ch.
Approaches include PROSITE, a method for detecting active sites and patterns in protein sequences (Bairoch, 1991) ; PFAM/SCOP analysis, a protein domain assignment method ' (Murzin et al , 1995); COGs (Clusters of Orthologous Groups) analysis, a method of protein function prediction; Superfamily assignments; Functional categories assignment
(Bray et al 2000) ; Signal peptide presence, for secreted or membrane proteins; Transmembrane region assignment (Wallin & Heijne, 1998) ; Secondary structure prediction (Rost & Sander 1995; Gamier et al 1996, see also the following servers at http : //www . nbrf . George town . edu/pi rwww/, http : //dodo . cpmc . col u bi a . edu/pp/predi c tprotein . h tml , or http : //www. expasy. ch/) ; and Homologous 3D structure check (Laskowski 1993) . An example of a method that can be used to assign a sequence to a particular 'fold' (structure) family is fold assignment by sequence similarity to protein of known 3D structure. The reported fraction of- fold assignments in the various genomes amounts only to about 10-20% of sequences. In outline, fold assignment include the steps of :
1) Sequence alignment of the target to the template. (e.g. BLAST (Altschul et al, 1990) PSI -BLAST (Deleage et al . , 1997) ; FASTA (Pearson & Lipman, 1988 - using either pairwise algorithms - Smith and Waterman, 1981; or multiple alignment algorithms - Needleman and Wunsch, 1970 and' Thompson et al, 1994 (CLUSTAL W) ) ; 2) Identification of structurally conserved regions (SCRs) from the template (s) and use of these to form the framework of the target model;
3) If needed, a check of the secondary structure prediction in the template / target, after which the alignment is modified by hand if necessary;
4) Building of the remaining structurally conserved variable regions;
5) Building of the side-chains; and
6) Checking the 'goodness' of the postulated model (e.g. using PROCHECK) . The process may be repeated from step 1 one or more times until step 6 is acceptable.
For this approach to work, about 30% sequence identity is usually necessary between the template and the target to use this method. It also requires full length sequence alignment where gaps or insertions can be positioned with confidence. Even so, errors in alignment occur - known as "misleading local sequence alignments" (MLSAs) , where an apparently unambiguous alignment between two proteins may not reflect- the correct "threading" of the sequence onto the structure (see Saqi et al 1998) .
There are also currently methods used to identify new proteins with below threshold sequence similarity. These include sequence profiling (Gribshov et al (1990) , or sequence motif searching (Bork & Gibson, 1996) . Sequence profile methods use evolutionary information from neighbouring sequences in the sequence database to build a profile. An iterative sequence profile method able to detect distant relationships is PSI-BLAST.
A further method is fold assignment, or 'threading'-. These methods explicitly incorporate structural information from available 3-D protein structures. In many cases these methods can detect distantly diverged proteins as well as unrelated proteins with a similar fold (see PROTEINS: 23(5), 1995; Suppl 1, 1997 and Suppl 3, 1999 for further details of methods and results) . However, unless at least one 3D protein structure of the family is known, the method will be unable to assign the new sequence to a structural or functional family.
Fold assignment methods are at least as successful as sequence profiling methods and, in addition, are able to assign another 10-15 % of open reading frames (ORFs) from genome sequencing projects. Furthermore, some of the predictions from folds assignment methods are not detectable using sequence based methods. Conversely, sequence based methods sometimes identify distant relationships that fold assignment methods do not detect. This is because the sequence methods incorporate evolutionary information from neighbouring sequences whereas traditional fold methods typically do not.
As of today, the majority of the fold assignments correspond to ORFs with known functions, unevenly covering the various functional classes (Riley, 1993 ) . The lowest number of assigned folds corresponds to proteins in the membrane, ribosomal, transcriptional and "unknown function" categories. 30-40% of ORFs in the various genomes are of unknown function. Of these half have no matches in the sequence database - classified as sequence orphan ORFs or "sequence ORFans" . Most are likely to be very distant members of existing families; however fold assignment methods succeed in predicting folds for only a small number of ORFans (Dubchak et al 1998) . Sequence based methods, are not able to assign folds to sequence ORFans because by definition ORFans have no sequence neighbours. Thus, in order to assign folds of function to the majority of sequence ORFans, more sensitive methods will be required.
Another problem in relation to protein structure is in providing an accurate and reliable theoretical method to identify peptides that bind with high affinity to HLA molecules. The identification of tumour and virus immunogenic epitopes is of great importance for the design of tumour and virus vaccines. The most common property of all the immunogenic peptides is their high affinity for the HLA molecule (Sette et al, 1994; Oukka et al , 1996, van den Burg et al, 1996; Tourdot et al, 1997) . Hence, considerable effort has gone into measuring the affinity of peptides to HLA molecules. A reliable theoretical method to identify peptide sequences within a given antigen that bind strongly to HLA would, therefore, be of great utility for the selection of immunogenic peptides, provided it is both efficient and accurate.
Affinity for HLA principally depends on the allele-specific pattern of conserved residues at particular positions in the peptide, the primary anchor motifs (Rammensee et al, 1995; Engelhard, 1994) . Although the large majority of immunogenic epitopes possess the allele-specific primary anchor motifs the presence of these motifs is not a sufficient condition for a peptide to show strong binding. Secondary anchors and deleterious residues at non-conserved positions also influence the peptide-HLA interaction (Ruppert et al, 1993) . Selection of high affinity peptides cannot therefore be achieved exclusively on the basis of the presence of primary anchor motifs - less than 30% of peptides having the allele specific primary anchor motifs exhibit strong binding (Gulukota et al , 1997) . Nor can the selection be carried out on the basis of the extended motif pattern, which takes into account not only primary but also secondary anchors and deleterious motifs. In fact, when this approach is used (Parker et al, 1994) no more than 50% of high affinity peptides are identified.
The main problem with the current motif based selection methods is the implicit assumption that the side chains of individual residues bind to the HLA in an independent manner. However, current knowledge on peptide-HLA interactions provided by recent crystallographic data, and by our own studies, does not fit with this assumption (Tourdot et al, 1997, Madden et al, 1993). For instance, Db affinity seems to depend on the presence of favourable 'sequences' rather than on the presence of favourable λ residues' at positions interacting with the Db molecule (Tourdot et al, 1997) . This strongly suggests that the efficient selection of high affinity immunogenic peptides requires a residue-independent, sequence-dependent affinity prediction model. Further, given the present state-of-the-art in sequence comparison it is unlikely that methods relying solely on identification of motifs in sequence space will be successful.
Grassy et al , (1998) disclose a method for computer-assisted rational design of immunosuppressive compounds. The reference describes the analysis of a set of peptides for immunosuppressive activity. A learning set of inactive and active peptides were analysed by a range of topological descriptors, and a set of topological descriptors for the active set of peptides was defined. The descriptors were used to screen a virtual combinatorial library of peptide which was generated based on a partially randomised lOmer consensus sequence in which positions 1, 5 and 10 were fixed as arginine, arginine, and tyrosine respectively.
Disclosure of the Invention.
We have developed novel methodology for the identification of regions within proteins which have a desired function, wherein the identification is independent of sequence homology.
The method utilises a residue independent computational model (SCIPS, for Sequence Comparison In Property Space) whose inputs are not sequences but physicochemical and topological parameters derived from those sequences. The ability of this approach to identify and classify within the same 'activity' family sequences that show very low homology has applications in any situation where relationships between distant sequence patterns are sought .
In a first aspect, the invention provides a method for determining whether a query protein sequence has a functional property of interest, which method comprises:
(i) providing a dataset of proteins which share the functional property of interest ;
(ii) encoding a plurality of physicochemical and/or topological descriptor parameters for at least one frameset of each member of the dataset ;
(iii) determining a set of said parameters which describe the at least one frameset; and
(iv) scanning a query protein sequence for the presence of a frameset which matches said parameters.
In a second aspect, the invention provides a method for determining whether a query protein sequence has a functional property of interest, which method comprises: (i) providing a dataset of proteins which share the functional property of interest;
(ii) determining for each protein of said dataset at least one frameset, the frameset being a region within the protein which imparts to the protein said functional property of interest;
(iii) encoding a plurality of physicochemical and/or topological descriptor parameters for each frameset;
(iv) determining from the encoded descriptor parameters a set of descriptor parameters and their values which describe each frameset and which are indicative of said functional property of interest; and (v) scanning a query protein sequence for the presence of a frameset which matches said set of descriptor parameters .
A further aspect of the invention provides a computer system which is operatively configured to implement the method of the first or second aspect.
By "a computer system" we mean the hardware means, software means and data storage means used to determine whether a query protein sequence has a functional property of interest according to the present invention. The minimum hardware means of a computer-based system of the present invention typically comprises a central processing unit (CPU) , working memory and data storage, and e.g. input means, output means etc. The data storage may comprise magnetic storage media such as floppy discs, hard disc storage medium and magnetic tape; optical storage media such as optical discs or CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. Examples of such systems are microcomputer workstations available from Silicon Graphics Incorporated and Sun Microsystems running Unix based, Windows NT or IBM OS/2 operating systems.
Further aspects of the invention provide (i) computer software code for implementing the method of the first or second aspect, and (ii) a computer programming product carrying the software code .
These and further aspects of the invention are described herein below. Description of the Drawings.
Figure 1 shows a flow chart which schematically illustrates the invention.
Figure 2 shows histograms and corresponding normal distributions of the measured relative affinity (RA) values for 120 immunogenic and 72 non immunogenic sequences.
Figure 3A & B shows data relating to validation of the model
Figure 4 shows the immunogenicity of peptides belonging to the external validation set.
Figures 5a and b respectively show distributions of the BiMass and SFPEITHI scores for the high and low affinity peptides of the external validation set.
Detailed description of the invention.
Definitions .
Query Protein Sequence .
This is any protein sequence available to those of skill in the art, for example a protein sequence available on a public or private database, whether with an ascribed function or unknown function. The protein sequence may be of any species origin, including human, primate, mammalian, vertebrate, insect, yeast, eukaryotic or prokaryotic. The sequence may be a confirmed protein sequence (i.e. based on direct protein sequence or a translation of an mRNA) or a hypothetical protein sequence based upon an open reading frame of genomic DNA, or a de novo designed sequence. Dataset of Proteins .
The dataset of proteins may be a set of proteins associated with the functional property of interest. The proteins may also be of any species origin, including human, primate, mammalian, vertebrate, insect, yeast, eukaryotic or prokaryotic. Depending upon the property of interest, the dataset may be limited to proteins of a single species or it may comprise sequences from different organisms.
The dataset may further comprise synthetically generated sequences, for example sequences which have been synthesised in a random or semi-random manner and selected to have the property of interest .
The proteins may be full length sequences, partial sequences or peptides, or mixtures thereof. Where the sequences are partial sequences or peptides, these will be of sufficient length to be associated with the functional property. This will usually be at least 5, more generally at least 8, such as at least 9 or 10 amino acids in length.
The size of the dataset will be dependent upon a number of circumstances, including the number of sequences available in the art and the discrimination achieved by the topological parameters. Usually, it is desirable that the dataset comprises at least 5, preferably at least 10, more preferably at least 20 members. The maximum size of the dataset will be dependent only upon the numbers of sequences available and the computational power available for their analysis. However, a dataset of up to 500 members is feasible.
The term "proteins" in relation to the dataset is used to mean any polypeptide sequence, including short sequences (e.g. 5 or more amino acids) which are often referred to in the art as "peptides" or "polypeptides" .
Functional Property of Interest . This is any property associated with a set of sequences and known to a person of skill in the art. In the accompanying example, we have provided a dataset of peptides associated with HLA-binding properties. However, this is not limiting and the principle illustrated may be applied to any other property. Such properties include domains with enzymatic function, e.g. kinase, phosphatase or acetylase activity, domains which bind target molecules such as other proteins or DNA, domains which act as agonists or antagonists of biological function, proteins involved in signal transduction such as GPCRs, intracellular effector proteins and families as defined in the SCOP database (http://scop.mrc- lmb.cam.ac.uk/scop/) .
In one preferred aspect, the dataset is one which the proteins contain regions of sequence homology associated with said functional property of interest. By "sequence homology", it is meant that those of skill in the art operating available sequence alignment algorithms are able to determine at least one region in the members of the dataset whose sequences contain regions of alignment of statistical significance. For example, the algorithm BLAST, mentioned above, may be used with default parameters to determine regions of sequence homology of at least 30% identity. This level of homology is often associated with a common evolutionary origin and hence related function of the protein. The region of homology may not be across the entire length of the protein, but within a frameset of the protein, as defined herein.
Frameset .
Proteins which share a functional property of interest may do' so because of a region of sequence within the protein which imparts such a property to the protein. This is well known as such by those of skill in the art. For example, DNA- binding domains, zinc-finger domains, transmembrane domains, signal sequences and the like are found in subdomains of proteins and often such a subdomain may be transferred ("donated") to other "recipient" protein sequences to transfer the property from donor to recipient.
Sometimes, the property may be discontinuous within the protein. For example, the target binding of an antibody variable domain is primarily determined by the properties of three hypervariable regions in each of a light and heavy chain variable domain which are separated by framework regions .
In the present invention, we refer to the region within the protein associated with the property of interest as a frameset. The frameset may be continuous or discontinuous to cover 2 or 3 different regions. The frameset is the region from which the descriptor parameters are determined.
Generally, the frameset will define sequences of substantially the same length, although some variation in sequence length is permitted. This is because functional domains shared by proteins will often differ in length to some extent . The actual size of the frameset will be determined by factors which will vary in each case, taking account of the functional property of interest and the capacity of the computational model. For example, in the accompanying example the frameset is a 9mer-10mer size. The frameset may be in this range, or greater, for example in the range of from 8 to 50 amino acids, such as from 8 to 40, for example from.10 to 25 amino acids.
In the accompanying example, the frameset is the' same size as the peptides of the proteins of the dataset . In other instances, the frameset may be defined as a sequence within a longer protein sequence. For example, a dataset of proteins which share a common property of interest may be aligned so that the region of each protein associated with the property is shown aligned, irrespective of where the region appears in relation to the rest of the protein in which it is located (see Figure 1 box 1) . This region is then used to determine a frameset, for which descriptors are calculated.
Physiochemical and Topological Descriptors . This includes any physical or chemical property of a molecule, including topological parameters of the molecule. It may include either properties which are "static" (at least in a time-averaged sense) such as the dipole moment of a molecule, and/or "dynamic", such as ones characterising the range of conformations through which the molecule may flex over a period of time. In the case of some molecules, the flexing of the molecule over time can be determined with high accuracy using modern molecular modelling techniques. In the present context, a frameset of a protein is also considered to be a molecule for which descriptor parameters can be measured or calculated.
A range of descriptors are defined in commercially available packages, such as TSAR™ software (Oxford Molecular, UK) .
Examples of descriptors are found in Grassy et al (1998) and in W099/33598. Descriptors include molar mass, ellipsoidal volume, molar volume, lipophilicity, dipole moment, total number of N or C or O atoms, number of methyl groups, and various topological, connectivity and shape indices, such as the Wiener or Balaban indices, the Kier Chi indices and the like. These and many other descriptors are defined in the "Tsar Reference Guide", published by Oxford Molecular (2000) .
These and other descriptors may be used in the present invention.
Description of the invention.
Figure 1 outlines an embodiment of the invention. In part (1) , a set .of protein sequences (shown by six parallel lines) each contain a region of homology (thick line) which is associated with a function common to all sequences.
The region of homology defines a frameset. The parameters of the frameset are encoded to provide a plurality of descriptors (step 2 of Figure 1) . Usually, from 10 to 100 descriptors, such as from 15 to 60 descriptors are encoded.
Desirably, a plurality of sequences which are not associated with the property are also encoded ("inactive sequences").
These may be randomly selected sequences. Inactive sequences will be of substantially the same size as the frameset, and similar in number. For example, where the number of protein sequences having activity is a number x, then from x/2 to 2x inactive sequences will likewise be encoded.
Having encoded the sequences, the descriptors are analysed in order to determine a set of descriptors and their values which describe the frameset (step 3 of Figure 1) . Preferably when inactive sequences have been encoded, the descriptors and their values are not common to the set of inactive sequences. A way of analysing and selecting the descriptors is as follows:
1- Intercorrelated descriptors are first either removed by using standard statistical practices, or decorrelated through algorithms known in the art like principal component analysis (PCA) or Gram-Schmidt orthogonalisation. 2- From the descriptors obtained in the previous step, a set of descriptors is selected in which space the set of active sequences- is well separated from the set of inactive sequences. The ideal situation is when there is no overlap between the two regions of the space. However, a person of skill in the art has various ways to handle cases where there is a certain overlap and to determine the smallest possible set of descriptors exhibiting the smallest possible overlapping region. Such descriptor space is not necessarily a linear one and various computational techniques are available in the art to perform such an analysis, including neural networks, genetic algorithms, partial least squares (PLS) , fuzzy logic and the like. The precise means of determination may be selected by a person of skill in the art, taking account of the nature and number of sequences to be analysed and personal preference. In the accompanying example we have used a neural network, and this is preferred. The process will provide a set of descriptor parameters for a frameset which are indicative of the property of interest . The number and nature of the descriptors selected will depend upon the frameset and property selected, and will differ on a case-by-case basis. Usually, about 15-40 descriptors will be selected, though this is not fixed. Having defined the descriptors, the method then comprises scanning a query protein sequence for the presence of a frameset which have parameters for the selected descriptors which match those which are indicative of the property. In Figure 1 frame 4 the frameset is shown as a box which is moved stepwise (e.g. 1 amino acid at a time) along, the protein sequence, wherein the value of the selected descriptors are calculated for the residues within each box. However, those of skill in the art will appreciate that this is schematic and not limiting, and that in the present context "scanning" means inspecting successive regions of the sequence. Thus for example, the scanning may be directed to pre-selected regions, or linearly along the protein sequence in steps of more than one amino acid.
For example, in the accompanying example, we have defined a set of descriptors for peptides which bind to HLA. The frameset of 9-10mers was defined by reference to a set of 22 descriptors. We also found that the frameset was characterised by strong motifs at positions 2 and the C- terminus. Thus each protein query sequence was pre-scanned to select framesets comprising these motifs, and these selected framesets were then analysed by the chosen descriptors.
The invention thus provides for the provision of novel peptides which have a property of interest, as well as for the identification of sequences which have a property of interest but which do not have sufficient sequence homology for this property to be identified by conventional methods.
Where the invention provides for the provision of novel peptides, a further aspect of the invention is a peptide obtained by the process of the invention, as well as a composition comprising said peptide plus a pharmaceutically acceptable carrier or diluent.
Peptides of particular interest which may be obtained in accordance with the invention include peptides which bind HLA Class I or Class II antigens, receptors and/or their cognate ligands, enzymes and protein-protein interaction inhibitors.
The invention is illustrated by the following example:
Introduction.
This example illustrates the invention in relation to a method for the prediction of HLA-A*0201 affinity, the selection of immunogenic HLA-A*0201 bound peptides, and the experimental test of the predictions. A unique feature of this computational method (SCIPS for Sequence Comparison In Property Space) is that it performs the selection in "property space" rather than sequence space and, as such, is capable of finding 'family' relationships among groups of peptides that are not identifiable using conventional sequence comparison methods. The model operates by the simulation of an artificial neural network whose inputs are not amino acid sequences but physicochemical and topological descriptors derived from those sequences . The model can identify 86.8% of high affinity peptides with a probability of correct prediction of 94.3%. More importantly, it is able to predict 88.6% of immunogenic peptides with an equally high probability (85.3%), leading to the possibility of creating an almost complete immunogenic epitope map of any tumour or virus antigen.
Immunogenic/nonimmunogenic peptide discrimination on the basis of HLA-A*0201. ffinity.
The aim of this study was to create a computational model for the identification of peptides exhibiting affinities for HLA- A*0201 sufficiently high to ensure their immunogenicity. It was therefore necessary to define the affinity threshold discriminating immunogenic from nonimmunogenic peptides under the experimental conditions of the HLA-A*0201 affinity measurements. 192 peptides with various HLA-A*0201 affinities were tested for their capacity to elicit a CTL response in HHD mice. Previously, we and other groups have shown that peptides immunogenic in HLA-A*0201 transgenic mice, such as HHD and A2/Kb mice, are also immunogenic in humans and, conversely, that peptides nonimmunogenic in HLA-A*0201 transgenic mice are nonimmunogenic in humans (van der Burg et al . , 1996, Bakker et al . , 1997). Each peptide was tested in more than twelve HHD mice in several independent experiments. Peptides were considered immunogenic when a) the specific , lysis of induced CTL was at least 15% above the nonspecific lysis and b) specific CTL were generated in more .than 20% of primed mice. According to these criteria 120 peptides were found immunogenic and 72 nonimmunogenic. The distribution of their relative affinity (RA) values is shown in Figure 2. As expected immunogenic peptides exhibited a much higher affinity (mean RA=0.01) than nonimmunogenic peptides (mean RA=1.08). The RA distributions of immunogenic and nonimmunogenic peptides exhibited only a small overlap. By using a confidence threshold of 90%, immunogenic peptides rank between RA -0.8 and 0.7 and nonimmunogenic peptides between RA 0.5 and 1.6. 91.1% of peptides with RA<0.7 were immunogenic. This percentage was 25% and 0.2% for peptides with RA=0.7-1 and RA>1 respectively. Since 113 out of 120 immunogenic peptides (94.2%) had an RA<0.7 and 61 out of 72 nonimmunogenic peptides (84.7%) had an RA>0.7, an RA=0.7 was considered as the affinity threshold discriminating immunogenic from nonimmunogenic peptides (p<0.0001).
Properties of the peptide database.
172 peptides extracted from 18 antigens were included in the database for the training of the neural network. 110 peptides were 9mers and 62 peptides were lOmers. 116 peptides exhibited a high (RA<0.7), 16 peptides an intermediate (RA=0.7-1.0) and 40 a low (RA>1..0) HLA-A*0201 affinity. Their sequence characteristics are illustrated in Table 1. Except at the anchor positions P2 and P9/10 occupied by any of L/M/l/V/A in the large majority of peptides (91.3%) there was a fair representation of all 20 amino acids.
Construction of the neural network model.
The architecture of the back propagation neural network, the transfer parameters and the convergence RMS, necessary to obtain good generalized performances, were optimised by trial and error with the help of the internal validation set formed by a random choice of 30% of the database. Numerous combinations of 60 descriptors were tested and an iterative selection procedure was followed by displaying the dependencies of the output variables on each input (descriptor) variable. For each descriptor combination, particular attention was paid to exclude combinations exhibiting a correlation of 0.7 or higher. Moreover, care was taken to keep the network sufficiently small in terms of the number of weights to be computed. In practical terms, the ratio p = Number of input samples / Number of weights to be evaluated was kept within the range 1.8 < p <2.2 (Tetko et al. , 1993) .
According to these rules of thumb, it was found that 22 input neurons (descriptors) , related to the size (Inertia Moment 1 size, Inertia Moment 2 size, Inertia Moment 1 length, Ellipsoidal volume) , the distribution of atomic partial charges (total dipole, dipole moment X, Y, Z component) , the lipophilicity (log Poct/ater/ total lipole, lipole X, Y, W component) and the topology (Kier chi3 cluster, Kier chiV3 cluster, Kier Chi 5 ring, Kier chi 6 ring, Balaban index, Flexibility Index, Number of Oxygen atoms, Number of Sulfur atoms) of the peptides studied, were required to describe the molecules. These descriptors were used in two neural networks architectures : 22 :3 :1 and 22 :4 :1.
A network comprising three neurones in the hidden layer, a noise level equal to 0.03 and a convergence rate of 0.03 was selected as the final model since the predictions allowed us to obtain the best simulation results. 120 peptides were used as the learning set and the remaining 52 peptides (39x9mers and 13xl0mers) were used as the internal validation set. Graphical plots of the experimental versus predicted RA values for the compounds constituting the internal validation set are shown in Figure 3AX. 45 of the 52 peptides (86.5%) were correctly classified in the high and low affinity classes by using the threshold RA=0.7, with a good correlation between predicted and experimental RA (mean of ΔRA =0.4 and r=0.86) (Fig 3AX) . The classification was efficient for both 9mers (84.6%) (Fig 3A2) and lOmers (92.3%) (Fig 3 3) . To evaluate whether the model could efficiently identify strong binders (RA<0.7) we classified peptides in four different groups : true positive (TP; strong binders well predicted) , true negative (TN; intermediate/weak binders well predicted) , false positive (FP; predicted strong binders which exhibit in fact an intermediate/low affinity) and false negative (FN; predicted intermediate/weak binders which exhibit in fact a high affinity) . We then calculated four different parameters : a) sensitivity (SEN) : proportion of high affinity peptides that are correctly identified (TP/TP+FN) , b) specificity (SPE) : proportion of intermediate/low affinity peptides that are correctly identified (TN/TN+FP) , c) positive predictive value (PPV) : probability that a predicted strong binder has an experimental high affinity (TP/TP+FP) and d) negative predictive value (NPV) : probability that a predicted intermediate/weak binder has an experimental intermediate/low affinity (TN/TN+FN) . The ideal theoretical model for the identification of high affinity peptides must combine a high sensitivity (detection of a high percentage of strong binders) , a high specificity (good discrimination between strong and intermediate/weak binders) , high PPV (low percentage of false positive peptides) and a high NPV (low percentage of false negative peptides) . Figure 3 shows that 86.8% of strong binders were identified with a high probability (PPV=94.3% and NPV=70.2%). This percentage was 83.3% and 100% for the 9mers and lOmers respectively.
Prediction of new binding sequences. In order to exploit the SCIPS model for finding new binding sequences and to validate it by an external validation set, we computationally screened the sequences of four tumour antigens (hTERT, HER-2/neu, PSMA and NPM/ALK) . For hTERT and HER-2/neu these protein sequences were scanned for framesets of both 9mers and lOmers while for PSMA and NPM/ALK the frameset was limited to 9mers . Interest was focused on frames which defined peptides with the HLA-A*0201 specific primary anchor and strong motifs (L/M/V/I/A at P2 and C-terminal P) , referred hereafter as HLA-A*0201 motifs because these peptides are likely to be HLA-A*0201 epitopes. From a total of 556 peptides having the HLA-A*0201 motifs 135 peptides were predicted to have a high affinity (50 hTERT, 45 HER- 2/neu, 24 PSMA, 16 NPM/ALK) and 421 peptides were predicted to have an intermediate/low affinity (185 hTERT, 158 HER- 2/neu, 41 PSMA, 37 NPM/ALK) . 48 of the 556 peptides were randomly selected to be tested for their affinity in blind experiments (Table 2) .
The results are summarised in Figure 3B, which shows the plots of the experimental and predicted RA values of the 48x9/l0mers peptides (Bi) , divided into 37x9mers (B2) and llxlOmers (B3) .
In each figure the lower left and upper right areas contain peptides correctly predicted according to the threshold RA=0.7. The four parameters, sensitivity (SEN), specificity
(SPE) , PPV and NPV define the goodness of the SCIPS model and are calculated as follows : SEN = TP/ (TP+FN) , SPE =
TN/(TN+FP), PPV = TP/(TP+FP), NPV = TN/ (TN+FN) where TP (true positive) corresponds to strong binders well predicted, TN
(true negative) corresponds to intermediate/weak binders well predicted, FP (false positive) corresponds to predicted strong binders which exhibit, in fact, an intermediate/low affinity and FN (false negative) corresponds to predicted intermediate/weak binders which exhibit in fact a high affinity. 38 of the 48 peptides (79.2%) were well classified with a relatively good correlation between predicted and experimental RA (mean of ΔRA =0.5 and r=0.63) (Fig 3B) . Only 6 out of 48 peptides (12.5%) showed a ΔRA>1. For the remaining 42 peptides the mean ΔRA was 0.35 and the r=0.83. The classification was efficient for both 9mers (78.4%) (Fig 3B2) and lOmers (81.8%) (Fig 3B3) . Figure 3BX also shows that 82.8% of strong binders were identified with a high probability (PPV=82.8% and NPV=73.7%) confirming results obtained with the internal validation set of peptides. This percentage was 85% and 77.8% for the 9mers and lOmers respectively.
The difference between predicted and experimental RA for the majority of peptides of the external validation set, represented by the mean ΔRA of 0.35, indicates that the SCIPS model does not include some experimentally high affinity peptides as strong binders. Most of these peptides must have a predicted RA ranging from 0.7 to 0.7+0.35 and, therefore, belong to the predicted intermediate affinity group. Using a predicted RA of 1.0 as a new threshold for the identification of strong binders, the sensitivity increased to 96% of the high affinity peptides. It was expected that this higher discrimination threshold would also result in more false positives, but the PPV was nevertheless high at 77%. Hence, the efficiency of the SCIPS model to identify strong binding sequences can be enhanced by raising the predicted RA threshold.
On the other hand, lowering the threshold of predicted RA to 0.3 enabled the elimination of some false positive and the identification of strong binders with a high probability. For a predicted RA=0.3, 91% of predicted strong binders were in fact strong binders and 68% of the high affinity peptides were still correctly classified. The probability of identifying strong binders increased to 100% as expected when a threshold of predicted RA=0 was used.
Correlation between immunogenicity and predicted RA
A crucial aim of the study described here was to create a prediction model able to identify HLA-A*0201 associated immunogenic peptides. Thus, it was important to confirm that predicted strong binders were, in fact, immunogenic while predicted intermediate/weak binders were not . Our interest was particularly focused on false positive and false negative peptides. Eighteen peptides of the external validation set were tested for their immunogenicity in HHD mice. Eight of them were strong (four true positive and four false negative) and ten intermediate/weak (five true negative and five false positive) binders. At least six mice were used for each peptide and results are shown in Figure 4. As expected, seven out of eight high affinity peptides were immunogenic while nine out of ten intermediate/low affinity peptides were not . It is noteworthy that the nonimmunogenic high affinity peptide (hTERT 934) was predicted to be intermediate/weak binder, while the immunogenic intermediate/weak binder (PSMA 130) was predicted to have a high affinity. To further evaluate the capacity of the SCIPS model to predict immunogenic peptides, the immunogenicities of 111 high affinity (97 true positive and 14 false negative) and 50 intermediate/low affinity (38 true negative and 12 false positive) peptides belonging to the learning set and the external and internal validation sets were assessed (Table 3) . None of peptide sequences were found in the murine equivalent of the human antigen from which they derived. Therefore, the absence of a CTL response could be attributed to their nonimmunogenicity and not to tolerance of the specific CTL repertoire. Each peptide was tested in more than six mice and in at least two independent experiments. As expected, 87.4% of high affinity peptides were immunogenic while 84% of intermediate/low affinity peptides were not. 93 out of 105 immunogenic peptides (88.6%) were predicted to be strong binders and 93 out of 109 peptides predicted to be strong binders (85.3%) were immunogenic. Interestingly, immunogenicity varied significantly between true positive and false negative (p=0.0159) and between true negative and false positive peptides (p=0.0137). 90.7% of true positive but only 64.3% of false negative peptides were immunogenic. Similarly, 7.9% of true negative but 41.6% of false positive peptides were immunogenic.
These results further underscore the capacity of the SCIPS model to predict HLA-A*0201 bound immunogenic peptides. The model allowed the identification of the majority of high affinity immunogenic peptides and also a significant percentage of intermediate/low affinity immunogenic peptides in the antigens studied.
The aim of this work was to create an HLA-A*0201 affinity prediction model capable of selecting strong HLA-A*0201 binders that, according to current immunological dogma, should also be immunogenic. The SCIPS model we describe allows the identification of almost all the high affinity but also a significant percentage of intermediate/low affinity immunogenic peptides. It represents, therefore, a powerful tool for the identification of immunogenic virus and tumour epitopes that could be used for specific vaccination. It is now well documented that virus and tumor antigens contain a large number of immunogenic epitopes (Menendez- Arias et al . , 1998; Cibotti et al . , 1992). The establishment of the complete immunogenic epitope map of these antigens could be of great immunotherapeutic interest for two reasons. The first derives from the assumption that the efficacy of vaccination depends mainly on the quality of the peptide specific CTL repertoire in terms of CTL frequency and avidity. This repertoire is established during the positive and negative selection in the thymus and its quality is different for one peptide compared with another (Theobald et al, 1997) . It seems, therefore, necessary for an efficient vaccination to select among all the immunogenic peptides, those which correspond to a high quality CTL repertoire. This is particularly important for antitumour vaccination where tumour antigens are non-mutated self-proteins and their specific CTL repertoire is strongly influenced by the mechanisms of negative selection (Disis et al . , 1996; Coletta et al . , 2000; Kast et al . , 1994). Second, the identification of a large number of immunogenic epitopes will allow a polyspecific vaccination that has been demonstrated to be more efficient than a monospecific vaccination (Oukka et al , 1996) .
The large majority of high affinity HLA-A*0201 epitopes have the specific anchor and strong residues (L/M/V/l/A) in P2 and C-terminal P (primary anchor motifs) . However, only 30% of peptides with primary anchor motifs exhibit a high affinity. This is due to the presence of secondary anchor motifs which are also involved either favorably or unfavorably in the peptide-HLA-A*0201 interaction. Extended motifs (primary and secondary anchor motifs) and statistical binding matrices have already been used to perform a search of high affinity immunogenic peptides (Parker et al . , 1994; Brusic et al . , 1994). Their use is based on the assumption that each amino acid in each position contributes with a certain binding energy independent of the neighbouring residues and that the binding of a given peptide is the result of combining the contribution from the different residues. Multiplying the relevant matrix values should then give an indication of the binding affinity of different sequences. Such models have given a number of erroneous binding predictions (Gulukota et al . , 1997). The relative failure of these models to predict affinity correctly is due largely to the fact that affinity is not ' individual-residue-dependent ' but rather 'individual residue-independent and sequence-dependent' as suggested by crystallographic data from MHC/peptides complexes and by our previous analysis of peptide/Db interaction.
This dependency on sequence represents a limitation of all alignment-based prediction methods. In fact, the description of a linear sequence in simple amino acid space severely underdetermines the properties that sequence is likely to display in three-dimensional space. When a sequence is ' reduced to a string of atoms and atom groups for which 2-D descriptors can be calculated the properties exhibited by the sequence cease to depend a priori on the amino acid as an independent chemical unit, but are integrated along the entire sequence. To further convince the reader that it is the 'descriptor' definition and selection that allows relationships within sequence families to be derived, we compare the behaviour of frequently used algorithms for homology scoring in the context of immunological epitopes (the BiMass score, Parker et al . , 1994; and SYFPEITHI score, ' Ramm'ensee et al . , 1999), with our own description using the SCIPS model. In Figure 5a and b we have respectively plotted the BiMass and SYFPEITHI scores of all the peptides shown in Table 2 for both high and low affinity peptides. As can be clearly seen, there is no correlation between the BiMass and SYFPEITHI scores and the experimental binding affinity and BiMass and SYFPEITHI are unable to discriminate low and high affinity peptides and, as such, would be a poor predictor of potential epitopes. By contrast, in Figure 3A we show the same data set of peptides using the descriptor-based neural net described earlier. The discrimination between high and low affinity peptides is clearly visible.
Neural networks (ANN) models have been used by others for the prediction of HLA-A*0201-peptide binding affinity (Gulukota et al., 1997; Buus et al . , 1999). The unifying characteristic of these previous approaches is that they rely on a residue dependent description of the peptides in which the neural net is trained to discriminate binding from non binding peptides. As a result, Gulukota et al (1997) are able to predict less than 40% of strong binders of their internal validation set with a probability that this is a correct prediction of only 50%. By contrast, the SCIPS model of the present invention described here can identify 86.8% of high affinity peptides with a probability of correct prediction of 94.3% (Fig. 3) . More importantly, it is able to predict 88.6% of immunogenic peptides with an equally high probability (85.3%) . In this regard it is interesting that some peptides can be identified that have intermediate or low affinity but which are nonetheless immunogenic. Moreover, it is noteworthy that none of these models has proven its efficacy in predicting new binding sequences . The utility of the SCIPS model is not only limited to the identification of the majority of high affinity immunogenic HLA-A*0201 associated peptides of an antigen but may allow the design of high affinity immunogenic variants of the intermediate/low affinity nonimmunogenic peptides. In previous reports we have demonstrated that intermediate/low affinity peptides can be of great interest in certain cases of tumour and virus immunotherapy (Tourdot et al . , 1997) . However, their efficient use requires the design of high affinity variants able to generate a strong CTL response. Selecting these peptide variants on the basis of predicted RA=3 or even RA=0 would allow the identification high affinity subset with an expected success rate close to 100%.
In conclusion, the SCIPS model of the present invention allows for the first time the creation of complete immunogenic epitope maps of tumour and virus antigens (antigen CTL epitope Biomap™) . A similar approach is currently being developed for peptides presented by HLA-
B*0702 and HLA-A*0301. Along with HLA-A201, these three HLA molecules cover 80% of the Caucasian population. The long- term benefits of this strategy would be that a reliable prediction of immunogenicity could be generated from genome data. The sequences from the human genome could be translated to "antigen CTL epitope Biomaps" of potential self- reactivities of autoimmune and antitumor relevance whereas the sequences from the various microbial and virus genomes could be translated to "antigen CTL epitope Biomaps" of potential interest in vaccine development.
On a more general note, the SCIPS method may also be applied to the analysis of polypeptide sequences. Using a scanning frame of sequences (eg 10-15 residues) encoded in property space, any new. sequence may be assigned to its correct functional family.
MATERIAL AND METHODS
Peptides .The peptide synthesis was achieved by a classic Fmoc chemistry protocol on an Automated Multiple Peptide Synthesis instrument (AMS 422, ABIMED) .
Genera tion of CTL in HHD mice . HLA-A*0201 transgenic, β2m - /-, Db-/- HHD mice (Pascolo et al, 1997) were injected sc with lOOμg of peptide emulsified in incomplete Freund's adjuvant (IFA) in the presence of 140μg of the I-Ab restricted HBVcore 128-140 T-helper epitope. After 11 days, spleen cells (5xl07 cells in 10ml) were stimulated in vi tro with peptide (lOμM) . On day 6 of culture, the bulk responder populations were tested for specific cytotoxicity by using uncoated or peptide coated HLA-A*0201 expressing RMAS-HHD murine tumour cells.
Measurement of Peptide Relative Affini ty to HLA-A* 0201 . T2 cells (3xl05 cells/ml) were incubated with various concentrations of peptides in serum-free RPMI 1640 medium supplemented with lOOng/ml of human β2m at 37°C for 16 hrs . Cells were then washed twice and stained with the BB7.2 mAb followed by FITC conjugated goat anti mouse Ig mAb to quantify the expression of HLA-A*0201. For each peptide concentration, the HLA-A*0201 specific staining was calculated as the % of the staining obtained with lOOμM of the reference peptide HIVpol 589 (IVGAETFYV) . The relative affinity (RA) is determined as : RA = (Concentration of peptide that induces 20% of HLA-A*0201 expression / Concentration of the reference peptide that induces 20% of HLA-A*0201 expression) and is expressed as logι0. The lower the RA value, the stronger is the peptide binding to HLA- A*0201. The mean RA value for each peptide was determined from at least three independent experiments. In all experiments, 20% of HLA-A*0201 expression using the reference peptide was obtained at l-3μM.
Artificial neural network model (SCIPS model) and structure- property calculations . The approximate 3D structure of the peptides were modelled using known HLA/peptide complexes and superimposed using the SYBYL software (Sybyl 6 . 6 , TRIPOS, St. Louis, Missouri) . For each peptide, 60 descriptors available in software package TSAR® 3.2 (Oxford Molecular pic, Oxford, England) were calculated. A multi-layered feed-forward neural network with error back-propagation training algorithm (TSAR® 3.2, Oxford Molecular Ltd, Oxford, England) was applied to RA binding predictions using the values of the 22 descriptors selected from the initial 60 (see Results section) . Multi- layered feed-forward networks are highly non-linear tools for function approximation. A summation of the combined inputs is used to predict the output values via a transfer function. In this study we used a three-layer, fully connected architecture. The parametric model represented by this network can be mathematically formulated as :
ym = Wjb.** ~θn ) m = 1, , M
Figure imgf000033_0001
where K is the number of input nodes, N the number of hidden nodes and M the number of output nodes. xk is the output of the input node k, θn is the bias of the input of hidden node n, Wkn is the weight connecting input node k to hidden node n, wnm is the weight connecting hidden node n to output node m, and f is the activation function. The neural network implementation in TSAR® uses an identity activation function. For our set of peptides, the artificial neural network calculated the difference between the predicted RA and the experimental values. This difference is used to adjust the weights in the hidden layers and to minimize the overall error. For testing the predictive ability of the SCIPS model, 30% of the input data were excluded from the learning set and used as an internal validation set.
Table 1. Amino acids occurrence at each sequence position in the HLA peptide database.
PI P2 P3 P4 P5 P6 P7 P8 P9t* P9** P10
A 11 2 12 20 11 8 17 12 5 5 3
G 9 0 10 24 13 18 13 3 0 0 0
L 16 133 30 5 18 20 26 28 57 7 18
I 6 8 3 5 8 8 6 1 8 2 5
M 3 17 8 2 ' 0 4 3 4 2 0 0
V 9 10 8 8 9 18 8 9 43 0 22
T 3 3 16 4 13 10 10 11 0 5 6
E 2 0 5 20 17 8 4 10 0 4 0
D 2 0 8 16 9 7 5 1 0 4 0
Q 7 1 19 8 11 5 8 9 0 1 0
K 7 0 3 3 3 1 5 6 0 0 0
Y 57 0 9 0 13 4 2 6 0 10 0
F 7 0 11 4 5 9 20 16 0 9 0
W 1 0 6 4 8 0 3 4 0 2 0
P 3 0 11 13 4 7 3 5 0 0 0
S 9 0 6 19 10 7 17 14 0 3 0 H 2 0 2 3 10 2 3 13 0 0 0
N 4 0 4 3 5 12 12 4 0 1 0
R 14 0 0 11 7 16 7 16 0 4 0
C 2 0 3 2 0 10 2 2 2 0 3 * c-terminal P9 of 9mers ** P9 of lOmers
Table 2. List of peptides selected by the SCIPS model. Each peptide shows a predicted relative affinity (see Methods) and its BiMass score.
Antigen RA Bi ass sequence experimental predicted Score hTERTS4 ( :ιo) ILAKFLHWLM 0.5 0.7 63 hTERTi22 YLPNTVTDA 0.0 0.5 52 hTERT122(10) YLPNTVTDAL 0.3 0.9 48 hTERT381 RLPQRYWQM 0.8 0.9 56 hTERT407 VLLKTHCPL 0.3 -0.2 134 hTERT496 (10) SLGKHAKLSL 1.7 1.8 21 hTERT511 (10) KMSVRGCAWL 1.4 1.4 297 hTERT540 ILAKFLHWL -0.4 1.3 1745 hTERT544 (10) FLHWLMSVYV -0.7 -0.1 1796 hTERT547 (10) WLMSVYWEL -0.5 -0.2 835 hTERT548 LMSVYWEL -1.0 -0.1 60 hTERTsie LLTSRLRFI 0.9 1.0 45 hTERT772 YMRQFVAHL 0.5 0.5 47 hTERTa7i (10) LLVTPHLTHA 0.4 -0.1 19 hTERT926 (10) GLFPWCGLLL -0.3 -1.1 79 hTERT934 LLDTRTLEV -0.4 1.0 47 h.TERT969 NMRRKLFGV 1.1 0.5 51 hTERToos (10) LLLQAYRFHA 0.7 0.5 181 hTERTiooβ LLQAYRFHA 0.3 -0.5 49 Antigen RA Bimass sequence experimental predic :ted Score hTERToiβ (io) VLQLPFHQQV 0.2 0.2 449 hTERTo72 WLCHQAFLL -0.1 -0.3 569 hTERTιo78 (io) FLLKLTRHRV 0.3 0.8 1183
HER-2/neu 83 YVLIAHNQV 0.6 0.1 103
HER-2/neu 10_ ; QLFEDNYAL -0.3 0.2 324
HER-2/neu 95_i YMIMVKCWM 1.6 1.5 90
HER-2/neu
TLSPGKNGV 1.7 1.7 69
1172
PSMA 20 WLCAGALVL 0.7 0.0 40 PSMA 130 IINEDGNEI 1.7 0.1 10 PSMA 193 KINCSGKIV 1.8 1.4 14 PSMA 2oo IVIARYGKV 1.8 1.4 2 PSMA 260 NLNGAGDPL 1.8 0.4 10
PSMA 286 AVGLPSIPV 0.1 0.2 6 PSMA 354 RIYNVIGTL 1.7 1.4 3
PSMA 473 LVHNLTKEL 1.7 1.5 3
PSMA 582 MVFELANSI 0.7 0.2 23
PSMA 660 VLRMMNDQL 1.4 1.4 1 PSMA 7u ALFDIESKV 0.2 0.2 1054 PSMA 733 YVAAFTVQA 0.0 0.0 7
NPM/ALK 4 SMDMDMSPL 0.7 0.3 14 NPM/ALK 4i QLSLRTVSL 1.4 0.3 21 NPM/ALK 64 AMNYEGSPI 0.4 0.3 7 NPM/ALK 74 KVTLATLKM 1.7 1.6 1 NPM/ALK 125 ELQAMQMEL 1.8 1.4 2 NPM/ALK 204 PLQVAVKTL 1.4 1.0 1 NPM/ALK 241 CIGVSLQSL 1.8 0.3 7 Antigen RA Bimass sequence experimental predicted Score
NPM/ALK 6i4 SLLLEPSSL 0.7 0.1, 79 NPM/ALK 650 GLPLEAATA 0.5 0.5 5 NPM/ALK 667 TILKSKNSM 1.7 1.7 2
Table 3. Relationship between immunogenicity and predicted RA.
Peptides Immunogemicity % of immunogenic
+
High affinity 97 ' 14 87.4
True positive 88 9 90.7 p=0.0159
False negative 9 5 64.3
Intermediate/ low affinity 8 42 16 True negative 3 35 7.9 p=0.0137 False positive 5 7 41.6
The 161 peptides tested for their immunogenicity belong to the' training and the validation (internal and external) sets, The chi-square test was used to compare immunogenicity between TP and FN and between TN and FP .
References :
Altschul SF et al J.Mol.Biol. 1990, Vol.215, p403.
Altschul SF et al Nucleic Acids Res. 1997 Vol.25 p3389.
Bakker, A.B.H., et al Int. J. Cancer. 1997. 70, 302-309.
Blundell,T Nature 1987, Vol.326 p347.
Bork P & Gibson TJ Meth. Enzymol . 1996 Vol.266 pl62.
Bray-JE, et al Protein Eng. 2000, Vol .13 , pl53.
Brusic, V., et al Prediction of MHC binding peptides using artificial neural network. In Complex Systems: Mechanism of Adaptation. Ed R.J. Stonier and X.H.You, IOS Press 1994.
Buus, S. et al Curr. Opinion. Immunol. 1999, 11, 209-213.
Chothia,C. Nature Vol.357, p543.
Cibotti, R., et al Proc . Natl . Acad. Sci. USA. 1992. 89: 416- 420.
Colella, T.A., et al J. Exp. Med. , 2000, 7, 1221-1231.
delaCruz, Thornton JM Protein Science 1999, Vol .8 , p750.
Deleage, G, et al , 1997, Biochimie 79; 681-686.
Disis M. L., et al J. Immunol. 156, 3151-3158, 1996.
Dubchak I et al Microbial . Comp .Genomics 1998 Vol .3 p543.
Fischer D & Eisenberg D Curr. Opin. Struct .Biol . 1999 Vol .9 p208.
Gamier J et al Meth. Enzymol. 1996 Vol.266 p540.
Grassy et al Nature Biotechnology 1998 Vol. 16 p748. Gribskov M et al Meth. Enzymol. 1990 Vol.183 p 146.
Gulukota, K. , J. et al J. Mol . Biol., 1997, 267, 1258-1267.
Holm,L & Sander, C. Nucleic Acid Res. 1994. Vol.22. p3600.
Kast, W.M., et al J. Immunol., 1994, 152, 3904-3912.
Laskowski RA et al J. Applied Cryst . 1993 Vol.26 p283.
Madden, D.R., et al Cell, 1993, 75, 693-708.
Menendez-Arias, L., et al Viral Immunol 1998, 11, 167-181.
Murzin AG et al J. Mol. Biol. 1995 247, 536-540 and (http: //scop.mrc-lmb. cam. ac.uk/scop/i-ndex. tml)
Needleman SB & Wunsch CD J. Mol. Biol. 1970 Vol.48 p443.
Orengo et al Curr .Opin. Struct .Biol . 1999 Vol .9 p374.
Orengo et al Nature 1994 Vol.372. p631.
Oukka, M et al . J. Immunol. 1996. 157: 3039-3045.
Parker, K.C., et al J. Immunol., 1994, 152; 163-175.
Pascolo, S., et al J. Exp. Med. 1997. 185: 2043-2051.
Pearson W & Lipman,DJ Proc .Natl .Acad. Sci .USA 1988 Vol.85, p2444.
Rammensee, H. G., et al, Curr. Opinion Immunol. 1994. 6 -. 13- 23.
Rammensee, H. G. et al , Immunogenetics, 1999, 50, 213-219.
Riley M Microbiol .Rev . 1993 Vol .3 p862.
Rost B & Sander C J. Mol. Biol. 1995 Vol. 23 p295. Ruppert, J., et al Cell. 1993. 74: 929-937.
Saqi MAS et al Protein Eng. 1998 Vol.11, 627.
Sette, A., et al . J. Immunol. 1994. 153: 5586-5592.
Smith TF & Waterman MS J.Mol.Biol. 1981 Vol.147 pl95.
Tetko, I., et al J.Med.Chem. 1993, 36, 811-814.
Theobald, M. , et al J. Exp. Med. 1997. 185: 833-841.
Thompson, J et al, Nucl. Acid Res. 1994, 22; 4673-4680.
Thomton J et al 1991 Nature 1991 Vol.354 No.6349 pp.105-106
Tourdot, S., et al J. Immunol. 1997. 159: 2391-2398.
van der Burg, et al . J. Immunol. 1996. 156: 3308-3314.
Wallin E & Heijne GV Protein Science 1998 Vol .7 pl029.

Claims

CLAIMS :
1. A method for determining whether a query protein sequence has a functional property of interest, which method comprises :
(i) providing a dataset of proteins which share the functional property of interest;
(ii) determining for each protein of said dataset at least one frameset, the frameset being a region within the protein which imparts to the protein said functional property of interest;
(iii) encoding a plurality of physicochemical and/or topological descriptor parameters for each frameset;
(iv) determining from the encoded descriptor parameters a set of descriptor parameters and their values which describe each frameset and which are indicative of said functional property of interest; and
(v) scanning a query protein sequence for the presence of a frameset which matches said set of descriptor parameters.
2. A method according to claim 1, wherein the dataset proteins contain regions of sequence homology associated with said functional property.
3. A method according to claim 1 or 2 , wherein said descriptor parameters are determined by a computational model which comprises a neural network.
4. A method according to claim 3, wherein the inputs to the computational model do not include peptide residue sequence data.
5. A method according to any one of claims 1 to 4, wherein the set of descriptor parameters comprises between 10 and 30 such parameters .
6. The method of any one of claims 1 to 5 wherein the frameset is from 8 to 40 amino acids in length.
7. The method of any one of claims 1 to 6 wherein the frameset is discontinuous.
8. A computer system which is operatively configured to implement the method of any one of claims 1 to 7.
9. Computer software code for implementing the method of any one of claims 1 or 7.
10. A computer programming product carrying the software code of claim 9.
PCT/GB2002/003244 2001-08-03 2002-07-15 Method for identification of protein function WO2003015001A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP01402106.7 2001-08-03
EP01402106 2001-08-03

Publications (2)

Publication Number Publication Date
WO2003015001A2 true WO2003015001A2 (en) 2003-02-20
WO2003015001A3 WO2003015001A3 (en) 2004-08-19

Family

ID=8182843

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2002/003244 WO2003015001A2 (en) 2001-08-03 2002-07-15 Method for identification of protein function

Country Status (1)

Country Link
WO (1) WO2003015001A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003054770A1 (en) * 2001-12-21 2003-07-03 Janssen Pharmaceutica N.V. A method of clustering transmembrane proteins
CN103177198A (en) * 2011-12-26 2013-06-26 深圳华大基因科技有限公司 Protein identification method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079263A2 (en) * 1999-06-18 2000-12-28 Synt:Em S.A. Identifying active molecules using physico-chemical parameters
WO2001031579A2 (en) * 1999-10-27 2001-05-03 Barnhill Technologies, Llc Methods and devices for identifying patterns in biological patterns

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079263A2 (en) * 1999-06-18 2000-12-28 Synt:Em S.A. Identifying active molecules using physico-chemical parameters
WO2001031579A2 (en) * 1999-10-27 2001-05-03 Barnhill Technologies, Llc Methods and devices for identifying patterns in biological patterns

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ADAMS H-P ET AL: "Prediction of binding to MHC class I molecules" JOURNAL OF IMMUNOLOGICAL METHODS, ELSEVIER SCIENCE PUBLISHERS B.V.,AMSTERDAM, NL, vol. 185, no. 2, 25 September 1995 (1995-09-25), pages 181-190, XP004021192 ISSN: 0022-1759 *
BRUSIC V ET AL: "PREDICTION OF MHC BINDING PEPTIDES USING ARTIFICIAL NEURAL NETWORKS" COMPLEX SYSTEMS: MECHANISM OF ADAPTATION, IOS PRESS, AMSTERDAM,, NL, 1994, pages 253-260, XP000933707 cited in the application *
BRUSIC V ET AL: "PREDICTION OF MHC CLASS II-BINDING PEPTIDES USING AN EVOLUTIONARY ALGORTIHM AND ARTIFICIAL NEURAL NETWORK" BIOINFORMATICS, OXFORD UNIVERSITY PRESS, SURREY, GB, vol. 14, no. 2, 1998, pages 121-130, XP000929180 ISSN: 1367-4803 *
GRASSY G ET AL: "COMPUTER-ASSISTED RATIONAL DESIGN OF IMMUNOSUPPRESSIVE COMPOUNDS" NATURE BIOTECHNOLOGY, NATURE PUBLISHING, US, vol. 16, August 1998 (1998-08), pages 748-752, XP000981977 ISSN: 1087-0156 cited in the application *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003054770A1 (en) * 2001-12-21 2003-07-03 Janssen Pharmaceutica N.V. A method of clustering transmembrane proteins
CN103177198A (en) * 2011-12-26 2013-06-26 深圳华大基因科技有限公司 Protein identification method

Also Published As

Publication number Publication date
WO2003015001A3 (en) 2004-08-19

Similar Documents

Publication Publication Date Title
Athanasios et al. Protein-protein interaction (PPI) network: recent advances in drug discovery
Schuler et al. SYFPEITHI: database for searching and T-cell epitope prediction
An et al. Comprehensive identification of “druggable” protein ligand binding sites
Frishman et al. Seventy‐five percent accuracy in protein secondary structure prediction
Pontén et al. The Human Protein Atlas—a tool for pathology
Kemmeren et al. Protein interaction verification and functional annotation by integrated analysis of genome-scale data
Paul et al. Evaluating the immunogenicity of protein drugs by applying in vitro MHC binding data and the immune epitope database and analysis resource
Stuart et al. LigBase: a database of families of aligned ligand binding sites in known protein sequences and structures
Sun et al. Advances in in-silico B-cell epitope prediction
Hattotuwagama et al. Quantitative online prediction of peptide binding to the major histocompatibility complex
Singh et al. Structural interaction fingerprints: a new approach to organizing, mining, analyzing, and designing protein–small molecule complexes
Sathiamurthy et al. Population of the HLA ligand database
Bhasin et al. Prediction of promiscuous and high-affinity mutated MHC binders
Ramensky et al. A novel approach to local similarity of protein binding sites substantially improves computational drug design results
US20050074809A1 (en) System and method for systematic prediction of ligand/receptor activity
Zheng et al. Protein structure prediction constrained by solution X-ray scattering data and structural homology identification
AU2001245011A1 (en) System and method for systematic prediction of ligand/receptor activity
Verkhivker et al. Monte Carlo simulations of the peptide recognition at the consensus binding site of the constant fragment of human immunoglobulin G: the energy landscape analysis of a hot spot at the intermolecular interface
Stoddard et al. Molecular recognition analyzed by docking simulations: the aspartate receptor and isocitrate dehydrogenase from Escherichia coli.
Lauria et al. Drugs polypharmacology by in silico methods: new opportunities in drug discovery
Flower et al. Computational vaccinology: quantitative approaches
JP2002533477A (en) Systems and methods for structure-based drug design including accurate prediction of binding free energy
WO2003015001A2 (en) Method for identification of protein function
WO2002073193A1 (en) Computer-based strategy for peptide and protein conformational ensemble enumeration and ligand affinity analysis
Sun et al. A novel conformational B-cell epitope prediction method based on mimotope and patch analysis

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BY BZ CA CH CN CO CR CU CZ DE DM DZ EC EE ES FI GB GD GE GH HR HU ID IL IN IS JP KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NO NZ OM PH PL PT RU SD SE SG SI SK SL TJ TM TN TR TZ UA UG US UZ VN YU ZA ZM

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZM ZW AM AZ BY KG KZ RU TJ TM AT BE BG CH CY CZ DK EE ES FI FR GB GR IE IT LU MC PT SE SK TR BF BJ CF CG CI GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP