WO2008033100A1

WO2008033100A1 - Method of predicting protein allergenicity using a support vector machine

Info

Publication number: WO2008033100A1
Application number: PCT/SG2007/000293
Authority: WO
Inventors: Joo Chuan Tong; Zong Hong Zhang
Original assignee: Agency For Science, Technology And Research
Priority date: 2006-09-11
Filing date: 2007-09-06
Publication date: 2008-03-20
Also published as: WO2008033100A8

Abstract

A method for predicting the allergenicity of a protein using a non-linear prediction model based on an algorithm such as a probabilistic function, artificial neural network, hidden Markov model, multiple regression or Bayesian network, and in particular a support vector machine. The method comprises a learning phase where a data set of allergenic proteins is used to train the prediction model, and a prediction phase where the allergenicity of a protein is determined using the trained prediction model. In one embodiment the support vector machine is trained by creating a hyperplane separating and classifying the training examples according to a kernel function, where a third degree polynomial kerne function was optimal.

Description

Method of predicting protein allergenicity using a support vector machine

Field of the invention

The present invention relates to predicting the structure, function, reactivity and/or binding of polypeptides. In particular, the present invention relates to using a computational system implementing a non-linear prediction model to predict the structure, function, reactivity and/or binding of a polypeptide.

Background of the invention

Polypeptides are polymers consisting of a sequence of amino acid residues, each of which is one of twenty possible amino acid residues ("amino acid residues" are referred to as "amino acids" hereafter). The sequence of amino acids forming the polypeptide is referred to as its primary structure. When not otherwise constrained, the polypeptide folds into a three-dimensional structure determined by its primary structure. The three-dimensional structure has a large effect on the functionality, reactivity and/or binding of the polypeptide molecule.

The use of computational systems to determine the structure, functionality, reactivity and/or binding of polypeptides is a very promising field. This has wide applications in, for example, determining the allergenicity of a polypeptide, Ig E interactions, B-cell epitopes or T-cell epitopes or in protein family classifications.

Following the various genome sequencing projects, including the human genome sequencing project, there is now a vast amount of polypeptide sequence information available. The use of computational systems to predict the structure, function, reactivity and/or binding of polypeptides complements and adds value to classical biological methods for determining the structure and function of polypeptides. In particular, the use of these computer modelling methods in allergy, antibody (IgA, IgD, IgG, IgE, IgM) interactions, B-cell epitope or T-cell epitope research integrates well into a clinical setting, such as in the diagnosis, prevention and treatment of diseases such as allergies, infections, cancer, immune diseases and the like.

Summary of the invention

The present invention aims to provide new and useful methods, computer systems and software for prediction of properties of polypeptides.

In general terms, the invention proposes that a target property of a polypeptide is predicted by defining a plurality of regions of the polypeptide, and for each of the regions obtaining one of more descriptors indicating a property of that sub- region. This data is fed to a non-linear prediction model which has previously been trained on corresponding data from other polypeptides.

Specifically, using a training set of polypeptides, a dataset of training examples is formed comprising, for each of the polypeptides, the values of the target property and the corresponding values of the plurality of descriptors for each of the regions of the polypeptides. During a learning phase, the training set is used to train an adaptive non-linear prediction model. In a prediction phase, the target property of a polypeptide which is not part of the training set, is predicted using the trained non-linear prediction model.

According to a specific expression, the present invention provides a method for predicting at least one target property of a target polypeptide, the method comprising:

(a) a learning phase of (i) for each of a plurality of polypeptides not including the target polypeptide, forming a respective training example comprising the target property of the respective polypeptide and, for each of a plurality of pre-defined regions of the polypeptide, the values of one or more descriptors indicating corresponding properties of the region of the peptide; and (ii) training a non-linear prediction model using the training examples; and (b) a prediction phase of:

(i) obtaining for the target polypeptide, the values of said descriptors for the pre-defined regions of the target polypeptide; and (ii) inputting the obtained descriptor values for the target polypeptide into the adaptive model; whereby the adaptive model outputs a value which is a prediction of the target property of the target polypeptide.

The polypeptides in the training set may exhibit the target property in varying degrees. Both polypeptides exhibiting the target property strongly, intermediately, weakly or not at all may be used in training the adaptive system.

The number of regions is at least two, but any higher number is possible, such as at least 5 or at least 10.

Each of the regions is preferably continuous (i.e. consists of a set of amino acids which are next to each other along the polypeptide). The plurality of regions preferably contains at least two regions which overlap. Preferably every amino acid in the polypeptide is part of at least two regions. More preferably, each polypeptide is within at least three regions. Optionally, one of the regions may be the entire sequence.

One or more of the properties described by the descriptor values may be physico-chemical properties of the amino acid residues, including but not limited to any one or more of: charge, polarizability, polarity, hydrophobicity, bulkiness, relative mutability, solvent accessibility, and/or normalized van der Waals volume.

In forming the descriptor values for a given amino acid property, it is helpful to categorize amino acids into classes, such that the amino acids of one class all have the property to a higher level than all the amino acids of the other. For example, if the property under consideration is size, we may define two classes (big/small) and partition amino acids into these two classes. The descriptor values may then each be such as to indicate how many amino acids in the region are in one or more of the classes and/or how the amino acids in one of more of the classes are distributed along the region.

More specifically, the descriptors may be selected as ones which fall into one of three types: composition (C)₁ transition (T) and distribution (D).

C-type descriptors represent the composition of the polypeptide from the point of view of a given amino acid property, by measuring the percentage of residues within the regions falling into each class of the property (e.g. what proportion of residues are "small").

T-type descriptors represent the frequency with which a particular amino acid property changes from one class to another along the entire region (e.g. the number of transitions between large to small and vice versa).

D-type descriptors represent the distribution pattern of a particular amino acid property along the entire region by measuring the location of the first, 25, 50, 75 and 100% of residues which fall into a certain class (e.g. the locations of the first small residue, and the first 25%, 50%, 75% and 100% of the small residues).

The non-linear prediction model may be a support vector machine (SVM), an artificial neural network, a hidden Markov model, or another statistical models, such as a multiple regression model or a Bayesian network.

The target property may be the allergenicity of a polypeptide. Alternatively, it may be degree of interaction with a specific antibody or class of antibodies. For example, it may be the level of binding to the IgE class of antibodies. Ig E interactions may lead to a cascade of events which eventually leads to an allergic response in a subject. The target property may also be prediction of B- cell epitopes, or T-cell epitopes. Furthermore, it may be used in protein family classification. The methods of the present invention are general and can be extended to the prediction for all types of protein functions and reactivity.

The invention facilitates cyclical refinement of predictive models for improved accuracy by inclusion of new experimental data. Having predicted that a certain polypeptide will have a certain activity, it is possible to test it to check the result, and if the prediction is wrong modify the model, thereby refining it.

Certain embodiments of the invention facilitate high accuracy predictions of protein function for which no experimental data are available.

Certain embodiments of the invention also enable large-scale screening of proteins of a said function and have the advantage that they can generally be used for the prediction of protein functionality for various protein families.

An advantage of some embodiments of the present invention is that they amplify the regional weights of important portions of the polypeptide sequence (e.g. overlapping binding or interaction sites) by combining local information about regions (which may or may not be overlapping) that are scattered throughout the sequence. Utilizing overlapping and/or separate regions of the polypeptide may provide high accuracy prediction when the data from the different regions are combined.

Brief description of the figures Having now generally described the invention, an embodiment of the invention will now be described for the sake of example only with reference to the accompanying figures in which:

Figure 1 is a flow diagram of steps of a method which is an embodiment of the invention;

Figure 2 is a schematic diagram of the division of a polypeptide sequence into 10 regions (A-J) in a first specific example of the invention;

Figure 3 is a schematic diagram of a polypeptide sequence consisting of glycine (G) and tyrosine (Y) residues divided into 10 regions (A-J) of varying length or composition according to the scheme of Figure 2, and representing multiple overlapping continuous and discontinuous epitopes; and

Figure 4 illustrates how in the embodiment polypeptide descriptor values are fed into an adaptive network in the embodiment.

Detailed description of the embodiment

An embodiment of the invention will now be described. It has the overall sequence of steps illustrated in Figure 1.

In a first step (step 1) a dataset of training examples is produced using a training set of polypeptides. The dataset includes for each polypeptide in the training set: a target property of the polypeptide, and one or more descriptor values for each of a number of regions of the polypeptide. Step 1 may include a number of sub-steps:

1. Deriving descriptor values from each of a set of regions of the polypeptides in the training set.

2. Combining the descriptors for each polypeptide in the training set. Optionally step 1 may also include removing any amino acids which are the identical in all the polypeptides of the training set, and which will be identical in all polypeptides whose properties the embodiment will be expected to predict. Such amino acids are not taken into account when deriving the descriptor values.

Step 1 further includes a sub-step of converting the combined descriptors into a format suitable for the adaptive system which is used in step 2, if that system has particular data format requirements.

In a second step (step 2) the dataset is used to train an adaptive non-linear prediction model.

In a third step (step 3) corresponding descriptor values are derived for corresponding regions of a new polypeptide which was not part of the training set.

In a fourth step (step 4) the adaptive system is used to predict the target property of a new polypeptide.

An explanation of the definition of the descriptor values used in the embodiment is now presented, followed by two experimental implementations of the embodiment.

1. Definition of the descriptor values

In this embodiment, each protein sequence is divided into 10 regions labelled as Regions A to J in Figure 2. These regions are selected to capture both sequential and conformational binding sites. The procedure begins by dividing a candidate sequence into 4 distinct, disjoint regions spanning the entire length of protein (Regions A to D). Next, pairs of adjacent regions are combined to form the next set of local regions to be investigated (Regions E to G). Following this, Regions E to G were further incremented by a quarter of the length of entire protein sequence to form Regions H to J.

Descriptors of three types (composition (C), transition (T) and distribution (D)), are used to represent the properties of each region. A descriptor of type C represents the composition of a regions from the point of view of a given amino acid property, by measuring the percentage of residues having that particular property along the specified region. A descriptor of type T represents the percentage frequency with which a particular amino-acid property changes along the entire region. A descriptor of type D characterizes the distribution pattern of a particular property along the entire region by measuring the location of the first residue with the property, and the location of the first 25, 50, 75 and 100% of residues with the property (Dubchak et al., 1995; Cai et al., 2003).

For example, consider a hypothetical protein sequence "GGYGYYGGGYYYGG" (SEQ ID: 1 ) containing 8 glycines and 6 tyrosines. This is divided into regions using the scheme of Figure 2, to give the result shown in Figure 3. Note that the regions A to D need not all be the same length (some are four acids long; some are three acids long). Each of these regions is approximately one quarter of the length of the chain, but some are rounded up and some down to give integer numbers. Exactly how this is done does not have a great effect on the prediction accuracy of the embodiment. Note that, as shown in Figure 2, the amino acids which are part of epitope 1 happen all to fall within region E, epitope 2 is encapsulated by region I, and epitope 3 is approximately region B.

As shown in Figure 3, the region E is the subsequence "GGYGYYG" (SEQ ID:2). Let us consider the amino acid property which is the size of the amino acid. Let n1 be the number of small amino acid residues (such as glycines) and n2 be the number of large amino acid residues (such as tyrosine) within a specific region. The values of n1 and n2 in region E are thus n1 = 4 and n2 = 3, and we can define two descriptors of type C as the respective proportions: n1 / (n1 + n2) x 100.00 = 0.57 and n2 / (n1 + n2) x 100.00 = 0.43. The values of the corresponding descriptors for the other regions can be calculated in a similar manner.

Another possible descriptor (of type T) measures the percent frequency with which there is a transition from small to large residues. In region E₁ there are 4 transitions between small and large residues with a percent frequency (4/6) x 100.00 = 66.67. The transitions for all other regions can be calculated in the same way.

The first small residue is the first residue, and the first 25%, 50%, 75% and 100% of small residues within region E are located within the first 1 , 2, 4 and 7 residues respectively. Five descriptors (of type D) for small residues (glycine) can thus be derived as 1/7 x 100.00 = 14.29, 1/7 x 100.00 = 14.29, 2/7 x 100.00 = 28.57, 4/7 x 100.00 = 57.14, 7/7 x 100.00 = 100.00. The corresponding D descriptors for large residues (tyrosine) can be calculated similarly.

All three types of descriptors (C, T and D) from all ten regions (A - J) were calculated, combined, and used as a feature vector for SVM training. In summary, 130 descriptor values are used to describe the polypeptide shown in Figure 3: 20 of type C, 10 of type T and 100 of type D.

The above example used a polypeptide having just glycine and tyrosine residues, but note that there the procedure can be applied to a polypeptide including any amino acid residues, provided that each amino acid is classified as either "large" or "small".

Furthermore, while the above example assumed that amino acid residues are treated as falling into just two classes (large/small), if instead, there are three classes, then the number of types of descriptors becomes 210: 30 of type C, 30 of type T and 150 of type D.

The same procedure can be carried out for each of the properties under consideration, producing a feature vector which is a representation of all the descriptors (of types C, D and T) for each region. This complete process is shown schematically in Figure 4. In Figure 4, 8 properties are considered (charge, polarizability, polarity, hydrophobicity, bulkiness, relative mutability, solvent accessibility, and normalized van der Waals volume), and the figure shows schematically how the properties of the three types (represented by the smallest squares) are obtained, and combined to give a feature vector.

2. First Experimental Implementation

In this example, which we will refer to elsewhere as "AllerPred", the embodiment is used as a prediction system for assessment of potential allergenicity of protein sequences.

2.1 The polypeptide database

The polypeptide database comprised 1906 (669 allergens and 1237 non- allergens) sequences.

The polypeptide database was divided into training and testing sets. The training set consists of 631 IUIS approved allergens from the ALLERDB database (Zhang et al., 2006; the disclosure of which is incorporated by reference) and 1219 non-allergens derived from Bjorklund et al., 2005 (of which the disclosure is incorporated by reference). This partition was performed using a de-biasing strategy based on sequence similarity of protein sequences commonly found in consumed food with no records in existing allergen databases (Saha et al. 2006). The percentage of allergens represents -34% of the testing dataset, while non-allergens represent the remaining 66%.

The testing dataset includes 38 IUIS allergens and 18 experimentally validated non-allergens extracted from the literature (Chakraborty et al., 2000; Laffer et al., 2003; Epton et al., 2002; Rihs et al. 2003; Ortona et al. 2003; Szakos et al. 2004; Dearman et al. 2001 ; Dearman et al. 2003; Banerjee et at. 2002; Takai et al. 1997; Mine et al. 2003).

2.2 Derivation of the dataset from the training set

For each of the eight properties shown in Figure 4, the amino acids were grouped into three classes so that, as mentioned above, for each property there were a total of 210 descriptors are used to describe each protein sequence: 30 for C, 30 for T and 150 for D. Thus, there were 1680 values in the feature vector for each polypeptide.

2.3 The non-linear prediction model

The implementation used a support vector machine (SVM) as the non-linear prediction model. A comprehensive coverage of SVMs is provided in the literature (Joachim, 2002; Vapnik, 1998). In brief, SVMs belong to a class of statistical learning methods based on the structural risk minimization principle. It is known for the inputs to the SVM to be binary strings or feature vectors representing encoded representations of amino acid attributes previously reported as significant for characterization of protein families. In the first implementation of the embodiment, the parameters of the SVM were trained by mapping the input vectors into a high dimensional feature space and constructing an optimal separating hyperplane in the new feature space. The optimal separating hyperplane maximizes the margin between the positive and negative datasets and uniquely classifies the data into positive and negative examples. Different kernel functions (linear, polynomial, radial, and sigmoid) were explored to optimize the prediction accuracy of the SVM models.

2.4 Model evaluation

For each kernel function, 10-fold internal cross validation was performed to assess to quality of the model (Tong et al., 2006). In /c-fold cross-validation, k random, (approximately) equal-sized, disjoint partitions of the sample data are constructed, and a given model is trained on {k-\) partitions and tested on the excluded partition. The results are averaged after k such experiments, and the observed error rate may be taken as an estimate of the error rate expected upon generalization to new data.

The predictive performance of each model was assessed using sensitivity (SE), specificity (SP) and receiver operating property (ROC) analysis as described previously (Tong et al., 2006). SE=TP/(TP+FN) and SP=TN/(TN+FP), represent percentages of correctly predicted allergens and non-allergens, respectively. TP (true positives) stands for allergens correctly predicted as allergens and TN (true negatives) for non-allergens correctly predicted as non-allergens. FN (false negatives) refers to allergens predicted as non-allergens and FP (false positives) represents non-allergens predicted as allergens. The accuracy of our predictions was assessed by ROC analysis where the ROC curve is generated by plotting SE as a function of (1-SP) for various classification thresholds. The area under the ROC curve (AROC) provides a measure of overall prediction accuracy, AROC<70% for poor, AROC>80% for good and AROC>90% for excellent predictions (Tong et al., 2006).

2.5 Results

The predictive performances of different kernel functions (linear, polynomial, radial, and sigmoid) were compared. In the example, the best results were obtained using a third degree polynomial kernel function. The AROC value is 0.81. Using amino acid composition as input for training and testing, the system can predict allergenic proteins with SE of 76.00% and SP of 76.00%.

A variety of techniques have previously been described for prediction of allergenicity in proteins using different testing datasets. To benchmark our system, the same testing dataset comprising of 38 IUIS allergens and 18 experimentally validated non-allergens were used to evaluate three existing allergenicity prediction techniques - wavelet transform models (Li et al., 2004), SVM models based on global descriptors (Cai et al., 2003) and sequence similarity search based on FAO/WHO Codex alimentarius guidelines (Fiers et al., 2004).

The results indicate that the embodiment which utilizes local sequence descriptors for training SVM models (AROC=0.81), consistently outperforms SVM models based on global sequence descriptors (AROC=0.71; Cai et al., 2003), wavelet analysis (AROC=0.69; Li et al., 2004) and FAO/WHO sequence similarity search (AROC=0.58; Fiers et al., 2004).

2.6 Discussion Collectively, the results indicate that the embodiment can make accurate predictions for potential allergenicity of proteins that have been validated using IUIS allergens and experimentally validated non-allergen sequences.

The property encoding scheme explained above allows the embodiment to model multiple overlapping continuous and discontinuous B-cell epitope binding patterns within a protein sequence. The system is trained using official allergens approved by the International Union of Immunological Societies (IUIS) Allergen Nomenclature Sub-Committee plus non-allergens commonly found in consumed food with no records in existing allergen databases, and tested on experimentally validated allergens and non-allergen sequences.

An advantage of the adaptive system herein described is that it takes into account conformational and overlapping B-cell epitope recognition sites. This results in improved prediction accuracy,

3. Second Experimental Implementation

In this second implementation, the training set consisted of 559 IUIS approved allergens that do not belong to the Betulaceae or Birch family (i.e. nut-bearing trees) from the ALLERDB database (Zhang et al., 2006) and 1219 non- allergens randomly extracted using in-house filtering software as described above.

The test dataset included 110 official allergens derived from the Betulaceae or Birch family. The prediction results were of high accuracy (AROC=O.75). Because the allergen has numerous binding sites scattered throughout the entire molecule, the embodiment's selection of local descriptors may not be the optimal representation of the said family and may be further improved. Nonetheless, the high accuracy of prediction indicates that the embodiment is capable of generalizing new data. The embodiment has the additional advantage that all the predictions were produced using a single predictive model.

References

The disclosure in the following references is hereby incorporated in its entirety:

Bock, J. R. and Gough, D.A. (2002) A new method to estimate ligand-receptor energetics. MoI. Cell. Proteomics, 1, 904-910. Bjδrklund, A.K., Soeria-Atmadja, D., Zorzet, A., Hammerling, U. and Gustafsson, M. G. (2005) Supervised identification of allergen-representative peptides for in silico detection of potential allergenic proteins. Bioinformatics, 21, 39-50.

Cai, C.Z., Han, LY., Ji, Z.L., Chen, X. and Chen, Y.Z. (2003) SVM-Prot: web- based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res., 31, 3692-3697. Dubchak, L₁ Muchnik, L, Holbrook, S. R. and Kim, S.-H. (1995) Prediction of protein class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 92, 8700-8704. FAO/WHO (2003) Codex Principles and Guidelines on Foods Derived from

Biotechnology. Rome, Italy, Joint FAO/WHO Food Standards Programme. Chakraborty, S., Chakraborty, N. and Datta, A. (2000) Increased nutritive value of transgenic potato by expressing a nonallergenic seed albumin gene from Amaranthus hypochondriacus. Proc. Natl. Acad. Sci. USA, 97, 3724-3729. Laffer, S., Hamdi, S., Lupinek, C, Sperr, W.R., Valent, P., Verdino, P., Keller, W., Grate, M., Hoffmann-Sommergruber, K., Scheiner, O., Kraft, D., Rideau, M., Valenta, R. (2003) Molecular characterization of recombinant T1 , a nonallergenic periwinkle (Catharanthus roseus) protein, with sequence similarity to the Bet v 1 plant allergen family. Biochem J., 373, 261-269. Li₁ K.-B., Issac, P. and Krishnan, A. (2004) Predicting allergenic proteins using wavelet transform. Bioinformatics, 20, 2572-2578.

Epton, M.J., Smith, W., Hales, BJ. , Hazell, L., Thompson, PJ. and Thomas,

W. R. (2002) Non-allergenic antigen in allergic sensitization: responses to the mite ferritin heavy chain antigen by allergic and non-allergic subjects. Clin.

Exp. Allergy, 32, 1341-1347.

Rihs, H. P., Dumont, B., Rozynek, P., Lundberg, M., Cremer, R., Bruning, T. and Raulf-Heimsoth, M. (2003) Molecular cloning, purification, and IgE-binding of a recombinant class I chitinase from Hevea brasiliensis leaves (rHev b 11.0102). Allergy, 58, 246-251.

Fiers, M.W., Kleter, GA, Nijland, H., Peijnenburg, A.A., Nap, J. P. and van Ham, R.C. (2004) Allermatch™, a webtool for the prediction of potential allergenicity according to current FAO/WHO Codex alimentarius guidelines. BMC Bioinformatics, 5, 133. Ortona, E., Margutti, P., Delunardo, F., Vaccari, S., Rigano, R., Profumo, E., Buttari, B., Teggi, A. and Siracusano, A. (2003) Molecular and immunological characterization of the C-terminal region of a new Echinococcus granulosus Heat Shock Protein 70. Parasite Immunol., 25,119-126.

Szakos, E., Lakos, G., Aleksza, M., Gyimesi, E., Pall, G., Fodor, B., Hunyadi, J., Solyom, E. and Sipka, S. (2004) Association between the occurrence of the anticardiolipin IgM and mite allergen-specific IgE antibodies in children with extrinsic type of atopic eczema/dermatitis syndrome. Allergy, 59, 164-167.

Dearman, RJ. and Kimber, I. (2001) Determination of protein allergenicity: studies in mice. Toxicol Lett, 120,181-186. Dearman, RJ., Stone, S., Caddick, H.T., Basketter, D.A. and Kimber, I. (2003) Evaluation of protein allergenic potential in mice: dose-response analyses. CHn. Exp. Allergy, 33, 1586-1594.

Banerjee, B., Kurup, V.P., Greenberger, P.A., Kelly, KJ. and Fink, J.N. (2002) C-terminal cysteine residues determine the IgE binding of Aspergillus fumigatus allergen Asp f 2. J. Immunol., 169, 5137-5144. Takai, T., Yuuki, T., Okumura, Y., Mori, A. and Okudaira, H. (1997)

Determination of the N- and C-terminal sequences required to bind human

IgE of the major house dust mite allergen Der f 2 and epitope mapping for monoclonal antibodies. MoI. Immunol., 34, 255-261. Mine, Y., Sasaki, E. and Zhang, J.W. (2003) Reduction of antigenicity and allergenicity of genetically modified egg white allergen, ovomucoid third domain. Biochem. Biophys. Res. Comm., 302, 133-137. Saha, S % G. P. Raghava, (2006), AlgPred: Prediction of allergenic proteins and mapping of IgE epitopes. Nucleic Acids Res, 34, W202-9. Tong J. C, G. L. Zhang, T. W. Tan, J. T. August, V. Brusic & S. Ranganathan:

Prediction of HLA-DQ3.2beta ligands: evidence of multiple registers in class Il binding peptides, Bioinformatics, 22, 1232-8 (2006). Zhang, Z. H., J. L. Koh, G. L. Zhang, K. H. Choo, M. T. Tammi & J. C. Tong

(2007), AllerTool: a web server for predicting allergenicity and allergic cross- reactivity in proteins, Bioinformatics, 23, 504-6 (2007).

Zhang, Z. H., S. C. Tan, J. L. Koh, A. Falus & V. Brusic (2006), ALLERDB database and integrated bioinformatic tools for assessment of allergenicity and allergic cross-reactivity, Ce// Immunol, 244, 90-6 (2006).

Claims

1. A method for predicting at least one target property of a target polypeptide, the method comprising:

(a) a learning phase of (i) for each of a plurality of polypeptides not including the target polypeptide, forming a respective training example comprising the target property of the respective polypeptide and, for each of a plurality of pre-defined regions of the polypeptide, one or more descriptor values representing corresponding properties of the region of the peptide; and (ii) training a non-linear prediction model using the training examples; and

(b) a prediction phase of:

(i) obtaining for the target polypeptide, the values of said descriptors for the pre-defined regions of the target polypeptide; and

(ii) inputting the obtained descriptor values for the target polypeptide into the non-linear prediction model; whereby the non-linear prediction model outputs a value which is a prediction of the target property of the target polypeptide.

2. A method according to claim 1 , wherein, for at least one of the regions, the properties comprise one or more of the charge, polarity, polarizability, hydrophobicity, size, bulkiness, relative mutability, solvent accessibility or normalized van der Waals volume of at least one amino acid residue in the region.

3. A method according to claim 1 , wherein for each property, the amino acids are grouped into classes such that the amino acids of different classes exhibit the property to different degrees, and the descriptor values for that property describe how the amino acid residues of the region fall into the classes.

4. A method according to claim 3, wherein, for at least one property, at least one descriptor value represents the proportion of amino acid residues containing a particular class of the property along the region.

5. A method according to claim 3 or claim 4, wherein, for at least one property, at least one descriptor value represents the frequency with which a particular property changes from one class to another class along the region.

6. A method according to any of claims 3 to 5, wherein, for at least one property, at least one descriptor value represents the location along the region of the first, 25, 50, 75 and 100% of amino acid residues with a specific class of the property.

7. A method according to any one of claims 1 to 6, wherein the number of pre-defined regions is at least five.

8. A method according to any preceding claim, wherein the non-linear prediction system is selected from the group consisting of support vector machine, probabilistic function, artificial neural network, hidden Markov model, multiple regression and Bayesian network.

9. A method according to any of claims 1 to 7 in which the non-linear prediction model is a support vector machine, and the step of training the support vector machine comprises the steps of creating a hyperplane separating and classifying the training examples according to a kernel function, the method including a process of optimising the kernel function.

10. A method according to any preceding claim, wherein the target property is an allergenicity.

11. A method according to any preceding claim, wherein the target property is the binding ability to a specific antibody or class of antibodies

12. A computer system comprising a processor programmed to perform a method according to any one of claims 1 to 11.

13. A computer program product comprising software executable by a computer system to cause the computer system to perform the method of any one of claims 1 to 11.