WO2008033100A1 - Method of predicting protein allergenicity using a support vector machine - Google Patents

Method of predicting protein allergenicity using a support vector machine Download PDF

Info

Publication number
WO2008033100A1
WO2008033100A1 PCT/SG2007/000293 SG2007000293W WO2008033100A1 WO 2008033100 A1 WO2008033100 A1 WO 2008033100A1 SG 2007000293 W SG2007000293 W SG 2007000293W WO 2008033100 A1 WO2008033100 A1 WO 2008033100A1
Authority
WO
WIPO (PCT)
Prior art keywords
property
polypeptide
target
region
regions
Prior art date
Application number
PCT/SG2007/000293
Other languages
French (fr)
Other versions
WO2008033100A8 (en
Inventor
Joo Chuan Tong
Zong Hong Zhang
Original Assignee
Agency For Science, Technology And Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency For Science, Technology And Research filed Critical Agency For Science, Technology And Research
Publication of WO2008033100A1 publication Critical patent/WO2008033100A1/en
Publication of WO2008033100A8 publication Critical patent/WO2008033100A8/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present invention relates to predicting the structure, function, reactivity and/or binding of polypeptides.
  • the present invention relates to using a computational system implementing a non-linear prediction model to predict the structure, function, reactivity and/or binding of a polypeptide.
  • Polypeptides are polymers consisting of a sequence of amino acid residues, each of which is one of twenty possible amino acid residues ("amino acid residues" are referred to as “amino acids” hereafter).
  • the sequence of amino acids forming the polypeptide is referred to as its primary structure.
  • the polypeptide folds into a three-dimensional structure determined by its primary structure. The three-dimensional structure has a large effect on the functionality, reactivity and/or binding of the polypeptide molecule.
  • the present invention aims to provide new and useful methods, computer systems and software for prediction of properties of polypeptides.
  • the invention proposes that a target property of a polypeptide is predicted by defining a plurality of regions of the polypeptide, and for each of the regions obtaining one of more descriptors indicating a property of that sub- region. This data is fed to a non-linear prediction model which has previously been trained on corresponding data from other polypeptides.
  • a dataset of training examples comprising, for each of the polypeptides, the values of the target property and the corresponding values of the plurality of descriptors for each of the regions of the polypeptides.
  • the training set is used to train an adaptive non-linear prediction model.
  • the target property of a polypeptide which is not part of the training set is predicted using the trained non-linear prediction model.
  • the present invention provides a method for predicting at least one target property of a target polypeptide, the method comprising:
  • the polypeptides in the training set may exhibit the target property in varying degrees. Both polypeptides exhibiting the target property strongly, intermediately, weakly or not at all may be used in training the adaptive system.
  • the number of regions is at least two, but any higher number is possible, such as at least 5 or at least 10.
  • Each of the regions is preferably continuous (i.e. consists of a set of amino acids which are next to each other along the polypeptide).
  • the plurality of regions preferably contains at least two regions which overlap.
  • every amino acid in the polypeptide is part of at least two regions. More preferably, each polypeptide is within at least three regions.
  • one of the regions may be the entire sequence.
  • One or more of the properties described by the descriptor values may be physico-chemical properties of the amino acid residues, including but not limited to any one or more of: charge, polarizability, polarity, hydrophobicity, bulkiness, relative mutability, solvent accessibility, and/or normalized van der Waals volume.
  • the descriptor values for a given amino acid property it is helpful to categorize amino acids into classes, such that the amino acids of one class all have the property to a higher level than all the amino acids of the other. For example, if the property under consideration is size, we may define two classes (big/small) and partition amino acids into these two classes. The descriptor values may then each be such as to indicate how many amino acids in the region are in one or more of the classes and/or how the amino acids in one of more of the classes are distributed along the region.
  • C-type descriptors represent the composition of the polypeptide from the point of view of a given amino acid property, by measuring the percentage of residues within the regions falling into each class of the property (e.g. what proportion of residues are "small").
  • T-type descriptors represent the frequency with which a particular amino acid property changes from one class to another along the entire region (e.g. the number of transitions between large to small and vice versa).
  • D-type descriptors represent the distribution pattern of a particular amino acid property along the entire region by measuring the location of the first, 25, 50, 75 and 100% of residues which fall into a certain class (e.g. the locations of the first small residue, and the first 25%, 50%, 75% and 100% of the small residues).
  • the non-linear prediction model may be a support vector machine (SVM), an artificial neural network, a hidden Markov model, or another statistical models, such as a multiple regression model or a Bayesian network.
  • SVM support vector machine
  • an artificial neural network such as a hidden Markov model
  • a hidden Markov model such as a Bayesian network.
  • the target property may be the allergenicity of a polypeptide. Alternatively, it may be degree of interaction with a specific antibody or class of antibodies. For example, it may be the level of binding to the IgE class of antibodies. Ig E interactions may lead to a cascade of events which eventually leads to an allergic response in a subject.
  • the target property may also be prediction of B- cell epitopes, or T-cell epitopes. Furthermore, it may be used in protein family classification. The methods of the present invention are general and can be extended to the prediction for all types of protein functions and reactivity.
  • the invention facilitates cyclical refinement of predictive models for improved accuracy by inclusion of new experimental data. Having predicted that a certain polypeptide will have a certain activity, it is possible to test it to check the result, and if the prediction is wrong modify the model, thereby refining it.
  • Certain embodiments of the invention facilitate high accuracy predictions of protein function for which no experimental data are available.
  • Certain embodiments of the invention also enable large-scale screening of proteins of a said function and have the advantage that they can generally be used for the prediction of protein functionality for various protein families.
  • An advantage of some embodiments of the present invention is that they amplify the regional weights of important portions of the polypeptide sequence (e.g. overlapping binding or interaction sites) by combining local information about regions (which may or may not be overlapping) that are scattered throughout the sequence. Utilizing overlapping and/or separate regions of the polypeptide may provide high accuracy prediction when the data from the different regions are combined.
  • Figure 1 is a flow diagram of steps of a method which is an embodiment of the invention.
  • Figure 2 is a schematic diagram of the division of a polypeptide sequence into 10 regions (A-J) in a first specific example of the invention
  • Figure 3 is a schematic diagram of a polypeptide sequence consisting of glycine (G) and tyrosine (Y) residues divided into 10 regions (A-J) of varying length or composition according to the scheme of Figure 2, and representing multiple overlapping continuous and discontinuous epitopes; and
  • Figure 4 illustrates how in the embodiment polypeptide descriptor values are fed into an adaptive network in the embodiment.
  • a dataset of training examples is produced using a training set of polypeptides.
  • the dataset includes for each polypeptide in the training set: a target property of the polypeptide, and one or more descriptor values for each of a number of regions of the polypeptide.
  • Step 1 may include a number of sub-steps:
  • step 1 may also include removing any amino acids which are the identical in all the polypeptides of the training set, and which will be identical in all polypeptides whose properties the embodiment will be expected to predict. Such amino acids are not taken into account when deriving the descriptor values.
  • Step 1 further includes a sub-step of converting the combined descriptors into a format suitable for the adaptive system which is used in step 2, if that system has particular data format requirements.
  • step 2 the dataset is used to train an adaptive non-linear prediction model.
  • a third step (step 3) corresponding descriptor values are derived for corresponding regions of a new polypeptide which was not part of the training set.
  • step 4 the adaptive system is used to predict the target property of a new polypeptide.
  • each protein sequence is divided into 10 regions labelled as Regions A to J in Figure 2. These regions are selected to capture both sequential and conformational binding sites.
  • the procedure begins by dividing a candidate sequence into 4 distinct, disjoint regions spanning the entire length of protein (Regions A to D). Next, pairs of adjacent regions are combined to form the next set of local regions to be investigated (Regions E to G). Following this, Regions E to G were further incremented by a quarter of the length of entire protein sequence to form Regions H to J.
  • Descriptors of three types are used to represent the properties of each region.
  • a descriptor of type C represents the composition of a regions from the point of view of a given amino acid property, by measuring the percentage of residues having that particular property along the specified region.
  • a descriptor of type T represents the percentage frequency with which a particular amino-acid property changes along the entire region.
  • a descriptor of type D characterizes the distribution pattern of a particular property along the entire region by measuring the location of the first residue with the property, and the location of the first 25, 50, 75 and 100% of residues with the property (Dubchak et al., 1995; Cai et al., 2003).
  • the region E is the subsequence "GGYGYYG” (SEQ ID:2).
  • n1 be the number of small amino acid residues (such as glycines) and n2 be the number of large amino acid residues (such as tyrosine) within a specific region.
  • the values of the corresponding descriptors for the other regions can be calculated in a similar manner.
  • Another possible descriptor measures the percent frequency with which there is a transition from small to large residues.
  • the transitions for all other regions can be calculated in the same way.
  • the first small residue is the first residue, and the first 25%, 50%, 75% and 100% of small residues within region E are located within the first 1 , 2, 4 and 7 residues respectively.
  • the corresponding D descriptors for large residues (tyrosine) can be calculated similarly.
  • the embodiment is used as a prediction system for assessment of potential allergenicity of protein sequences.
  • the polypeptide database comprised 1906 (669 allergens and 1237 non- allergens) sequences.
  • the polypeptide database was divided into training and testing sets.
  • the training set consists of 631 IUIS approved allergens from the ALLERDB database (Zhang et al., 2006; the disclosure of which is incorporated by reference) and 1219 non-allergens derived from Bjorklund et al., 2005 (of which the disclosure is incorporated by reference).
  • This partition was performed using a de-biasing strategy based on sequence similarity of protein sequences commonly found in consumed food with no records in existing allergen databases (Saha et al. 2006). The percentage of allergens represents -34% of the testing dataset, while non-allergens represent the remaining 66%.
  • the testing dataset includes 38 IUIS allergens and 18 experimentally validated non-allergens extracted from the literature (Chakraborty et al., 2000; Laffer et al., 2003; Epton et al., 2002; Rihs et al. 2003; Ortona et al. 2003; Szakos et al. 2004; Dearman et al. 2001 ; Dearman et al. 2003; Banerjee et at. 2002; Takai et al. 1997; Mine et al. 2003).
  • amino acids were grouped into three classes so that, as mentioned above, for each property there were a total of 210 descriptors are used to describe each protein sequence: 30 for C, 30 for T and 150 for D. Thus, there were 1680 values in the feature vector for each polypeptide.
  • SVM support vector machine
  • the implementation used a support vector machine (SVM) as the non-linear prediction model.
  • SVM support vector machine
  • a comprehensive coverage of SVMs is provided in the literature (Joachim, 2002; Vapnik, 1998).
  • SVMs belong to a class of statistical learning methods based on the structural risk minimization principle. It is known for the inputs to the SVM to be binary strings or feature vectors representing encoded representations of amino acid attributes previously reported as significant for characterization of protein families.
  • the parameters of the SVM were trained by mapping the input vectors into a high dimensional feature space and constructing an optimal separating hyperplane in the new feature space. The optimal separating hyperplane maximizes the margin between the positive and negative datasets and uniquely classifies the data into positive and negative examples.
  • Different kernel functions linear, polynomial, radial, and sigmoid
  • SE sensitivity
  • SP specificity
  • ROC receiver operating property
  • the accuracy of our predictions was assessed by ROC analysis where the ROC curve is generated by plotting SE as a function of (1-SP) for various classification thresholds.
  • the area under the ROC curve provides a measure of overall prediction accuracy, AROC ⁇ 70% for poor, AROC>80% for good and AROC>90% for excellent predictions (Tong et al., 2006).
  • the predictive performances of different kernel functions were compared. In the example, the best results were obtained using a third degree polynomial kernel function.
  • the AROC value is 0.81.
  • the system can predict allergenic proteins with SE of 76.00% and SP of 76.00%.
  • the property encoding scheme explained above allows the embodiment to model multiple overlapping continuous and discontinuous B-cell epitope binding patterns within a protein sequence.
  • the system is trained using official allergens approved by the International Union of Immunological Societies (IUIS) Allergen Nomenclature Sub-Committee plus non-allergens commonly found in consumed food with no records in existing allergen databases, and tested on experimentally validated allergens and non-allergen sequences.
  • IUIS International Union of Immunological Societies
  • An advantage of the adaptive system herein described is that it takes into account conformational and overlapping B-cell epitope recognition sites. This results in improved prediction accuracy,
  • the training set consisted of 559 IUIS approved allergens that do not belong to the Betulaceae or Birch family (i.e. nut-bearing trees) from the ALLERDB database (Zhang et al., 2006) and 1219 non- allergens randomly extracted using in-house filtering software as described above.
  • the test dataset included 110 official allergens derived from the Betulaceae or Birch family.
  • the embodiment has the additional advantage that all the predictions were produced using a single predictive model.
  • Non-allergenic antigen in allergic sensitization responses to the mite ferritin heavy chain antigen by allergic and non-allergic subjects.
  • AllerTool a web server for predicting allergenicity and allergic cross- reactivity in proteins, Bioinformatics, 23, 504-6 (2007).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Peptides Or Proteins (AREA)

Abstract

A method for predicting the allergenicity of a protein using a non-linear prediction model based on an algorithm such as a probabilistic function, artificial neural network, hidden Markov model, multiple regression or Bayesian network, and in particular a support vector machine. The method comprises a learning phase where a data set of allergenic proteins is used to train the prediction model, and a prediction phase where the allergenicity of a protein is determined using the trained prediction model. In one embodiment the support vector machine is trained by creating a hyperplane separating and classifying the training examples according to a kernel function, where a third degree polynomial kerne function was optimal.

Description

Method of predicting protein allergenicity using a support vector machine
Field of the invention
The present invention relates to predicting the structure, function, reactivity and/or binding of polypeptides. In particular, the present invention relates to using a computational system implementing a non-linear prediction model to predict the structure, function, reactivity and/or binding of a polypeptide.
Background of the invention
Polypeptides are polymers consisting of a sequence of amino acid residues, each of which is one of twenty possible amino acid residues ("amino acid residues" are referred to as "amino acids" hereafter). The sequence of amino acids forming the polypeptide is referred to as its primary structure. When not otherwise constrained, the polypeptide folds into a three-dimensional structure determined by its primary structure. The three-dimensional structure has a large effect on the functionality, reactivity and/or binding of the polypeptide molecule.
The use of computational systems to determine the structure, functionality, reactivity and/or binding of polypeptides is a very promising field. This has wide applications in, for example, determining the allergenicity of a polypeptide, Ig E interactions, B-cell epitopes or T-cell epitopes or in protein family classifications.
Following the various genome sequencing projects, including the human genome sequencing project, there is now a vast amount of polypeptide sequence information available. The use of computational systems to predict the structure, function, reactivity and/or binding of polypeptides complements and adds value to classical biological methods for determining the structure and function of polypeptides. In particular, the use of these computer modelling methods in allergy, antibody (IgA, IgD, IgG, IgE, IgM) interactions, B-cell epitope or T-cell epitope research integrates well into a clinical setting, such as in the diagnosis, prevention and treatment of diseases such as allergies, infections, cancer, immune diseases and the like.
Summary of the invention
The present invention aims to provide new and useful methods, computer systems and software for prediction of properties of polypeptides.
In general terms, the invention proposes that a target property of a polypeptide is predicted by defining a plurality of regions of the polypeptide, and for each of the regions obtaining one of more descriptors indicating a property of that sub- region. This data is fed to a non-linear prediction model which has previously been trained on corresponding data from other polypeptides.
Specifically, using a training set of polypeptides, a dataset of training examples is formed comprising, for each of the polypeptides, the values of the target property and the corresponding values of the plurality of descriptors for each of the regions of the polypeptides. During a learning phase, the training set is used to train an adaptive non-linear prediction model. In a prediction phase, the target property of a polypeptide which is not part of the training set, is predicted using the trained non-linear prediction model.
According to a specific expression, the present invention provides a method for predicting at least one target property of a target polypeptide, the method comprising:
(a) a learning phase of (i) for each of a plurality of polypeptides not including the target polypeptide, forming a respective training example comprising the target property of the respective polypeptide and, for each of a plurality of pre-defined regions of the polypeptide, the values of one or more descriptors indicating corresponding properties of the region of the peptide; and (ii) training a non-linear prediction model using the training examples; and (b) a prediction phase of:
(i) obtaining for the target polypeptide, the values of said descriptors for the pre-defined regions of the target polypeptide; and (ii) inputting the obtained descriptor values for the target polypeptide into the adaptive model; whereby the adaptive model outputs a value which is a prediction of the target property of the target polypeptide.
The polypeptides in the training set may exhibit the target property in varying degrees. Both polypeptides exhibiting the target property strongly, intermediately, weakly or not at all may be used in training the adaptive system.
The number of regions is at least two, but any higher number is possible, such as at least 5 or at least 10.
Each of the regions is preferably continuous (i.e. consists of a set of amino acids which are next to each other along the polypeptide). The plurality of regions preferably contains at least two regions which overlap. Preferably every amino acid in the polypeptide is part of at least two regions. More preferably, each polypeptide is within at least three regions. Optionally, one of the regions may be the entire sequence.
One or more of the properties described by the descriptor values may be physico-chemical properties of the amino acid residues, including but not limited to any one or more of: charge, polarizability, polarity, hydrophobicity, bulkiness, relative mutability, solvent accessibility, and/or normalized van der Waals volume.
In forming the descriptor values for a given amino acid property, it is helpful to categorize amino acids into classes, such that the amino acids of one class all have the property to a higher level than all the amino acids of the other. For example, if the property under consideration is size, we may define two classes (big/small) and partition amino acids into these two classes. The descriptor values may then each be such as to indicate how many amino acids in the region are in one or more of the classes and/or how the amino acids in one of more of the classes are distributed along the region.
More specifically, the descriptors may be selected as ones which fall into one of three types: composition (C)1 transition (T) and distribution (D).
C-type descriptors represent the composition of the polypeptide from the point of view of a given amino acid property, by measuring the percentage of residues within the regions falling into each class of the property (e.g. what proportion of residues are "small").
T-type descriptors represent the frequency with which a particular amino acid property changes from one class to another along the entire region (e.g. the number of transitions between large to small and vice versa).
D-type descriptors represent the distribution pattern of a particular amino acid property along the entire region by measuring the location of the first, 25, 50, 75 and 100% of residues which fall into a certain class (e.g. the locations of the first small residue, and the first 25%, 50%, 75% and 100% of the small residues).
The non-linear prediction model may be a support vector machine (SVM), an artificial neural network, a hidden Markov model, or another statistical models, such as a multiple regression model or a Bayesian network.
The target property may be the allergenicity of a polypeptide. Alternatively, it may be degree of interaction with a specific antibody or class of antibodies. For example, it may be the level of binding to the IgE class of antibodies. Ig E interactions may lead to a cascade of events which eventually leads to an allergic response in a subject. The target property may also be prediction of B- cell epitopes, or T-cell epitopes. Furthermore, it may be used in protein family classification. The methods of the present invention are general and can be extended to the prediction for all types of protein functions and reactivity.
The invention facilitates cyclical refinement of predictive models for improved accuracy by inclusion of new experimental data. Having predicted that a certain polypeptide will have a certain activity, it is possible to test it to check the result, and if the prediction is wrong modify the model, thereby refining it.
Certain embodiments of the invention facilitate high accuracy predictions of protein function for which no experimental data are available.
Certain embodiments of the invention also enable large-scale screening of proteins of a said function and have the advantage that they can generally be used for the prediction of protein functionality for various protein families.
An advantage of some embodiments of the present invention is that they amplify the regional weights of important portions of the polypeptide sequence (e.g. overlapping binding or interaction sites) by combining local information about regions (which may or may not be overlapping) that are scattered throughout the sequence. Utilizing overlapping and/or separate regions of the polypeptide may provide high accuracy prediction when the data from the different regions are combined.
Brief description of the figures Having now generally described the invention, an embodiment of the invention will now be described for the sake of example only with reference to the accompanying figures in which:
Figure 1 is a flow diagram of steps of a method which is an embodiment of the invention;
Figure 2 is a schematic diagram of the division of a polypeptide sequence into 10 regions (A-J) in a first specific example of the invention;
Figure 3 is a schematic diagram of a polypeptide sequence consisting of glycine (G) and tyrosine (Y) residues divided into 10 regions (A-J) of varying length or composition according to the scheme of Figure 2, and representing multiple overlapping continuous and discontinuous epitopes; and
Figure 4 illustrates how in the embodiment polypeptide descriptor values are fed into an adaptive network in the embodiment.
Detailed description of the embodiment
An embodiment of the invention will now be described. It has the overall sequence of steps illustrated in Figure 1.
In a first step (step 1) a dataset of training examples is produced using a training set of polypeptides. The dataset includes for each polypeptide in the training set: a target property of the polypeptide, and one or more descriptor values for each of a number of regions of the polypeptide. Step 1 may include a number of sub-steps:
1. Deriving descriptor values from each of a set of regions of the polypeptides in the training set.
2. Combining the descriptors for each polypeptide in the training set. Optionally step 1 may also include removing any amino acids which are the identical in all the polypeptides of the training set, and which will be identical in all polypeptides whose properties the embodiment will be expected to predict. Such amino acids are not taken into account when deriving the descriptor values.
Step 1 further includes a sub-step of converting the combined descriptors into a format suitable for the adaptive system which is used in step 2, if that system has particular data format requirements.
In a second step (step 2) the dataset is used to train an adaptive non-linear prediction model.
In a third step (step 3) corresponding descriptor values are derived for corresponding regions of a new polypeptide which was not part of the training set.
In a fourth step (step 4) the adaptive system is used to predict the target property of a new polypeptide.
An explanation of the definition of the descriptor values used in the embodiment is now presented, followed by two experimental implementations of the embodiment.
1. Definition of the descriptor values
In this embodiment, each protein sequence is divided into 10 regions labelled as Regions A to J in Figure 2. These regions are selected to capture both sequential and conformational binding sites. The procedure begins by dividing a candidate sequence into 4 distinct, disjoint regions spanning the entire length of protein (Regions A to D). Next, pairs of adjacent regions are combined to form the next set of local regions to be investigated (Regions E to G). Following this, Regions E to G were further incremented by a quarter of the length of entire protein sequence to form Regions H to J.
Descriptors of three types (composition (C), transition (T) and distribution (D)), are used to represent the properties of each region. A descriptor of type C represents the composition of a regions from the point of view of a given amino acid property, by measuring the percentage of residues having that particular property along the specified region. A descriptor of type T represents the percentage frequency with which a particular amino-acid property changes along the entire region. A descriptor of type D characterizes the distribution pattern of a particular property along the entire region by measuring the location of the first residue with the property, and the location of the first 25, 50, 75 and 100% of residues with the property (Dubchak et al., 1995; Cai et al., 2003).
For example, consider a hypothetical protein sequence "GGYGYYGGGYYYGG" (SEQ ID: 1 ) containing 8 glycines and 6 tyrosines. This is divided into regions using the scheme of Figure 2, to give the result shown in Figure 3. Note that the regions A to D need not all be the same length (some are four acids long; some are three acids long). Each of these regions is approximately one quarter of the length of the chain, but some are rounded up and some down to give integer numbers. Exactly how this is done does not have a great effect on the prediction accuracy of the embodiment. Note that, as shown in Figure 2, the amino acids which are part of epitope 1 happen all to fall within region E, epitope 2 is encapsulated by region I, and epitope 3 is approximately region B.
As shown in Figure 3, the region E is the subsequence "GGYGYYG" (SEQ ID:2). Let us consider the amino acid property which is the size of the amino acid. Let n1 be the number of small amino acid residues (such as glycines) and n2 be the number of large amino acid residues (such as tyrosine) within a specific region. The values of n1 and n2 in region E are thus n1 = 4 and n2 = 3, and we can define two descriptors of type C as the respective proportions: n1 / (n1 + n2) x 100.00 = 0.57 and n2 / (n1 + n2) x 100.00 = 0.43. The values of the corresponding descriptors for the other regions can be calculated in a similar manner.
Another possible descriptor (of type T) measures the percent frequency with which there is a transition from small to large residues. In region E1 there are 4 transitions between small and large residues with a percent frequency (4/6) x 100.00 = 66.67. The transitions for all other regions can be calculated in the same way.
The first small residue is the first residue, and the first 25%, 50%, 75% and 100% of small residues within region E are located within the first 1 , 2, 4 and 7 residues respectively. Five descriptors (of type D) for small residues (glycine) can thus be derived as 1/7 x 100.00 = 14.29, 1/7 x 100.00 = 14.29, 2/7 x 100.00 = 28.57, 4/7 x 100.00 = 57.14, 7/7 x 100.00 = 100.00. The corresponding D descriptors for large residues (tyrosine) can be calculated similarly.
All three types of descriptors (C, T and D) from all ten regions (A - J) were calculated, combined, and used as a feature vector for SVM training. In summary, 130 descriptor values are used to describe the polypeptide shown in Figure 3: 20 of type C, 10 of type T and 100 of type D.
The above example used a polypeptide having just glycine and tyrosine residues, but note that there the procedure can be applied to a polypeptide including any amino acid residues, provided that each amino acid is classified as either "large" or "small".
Furthermore, while the above example assumed that amino acid residues are treated as falling into just two classes (large/small), if instead, there are three classes, then the number of types of descriptors becomes 210: 30 of type C, 30 of type T and 150 of type D.
The same procedure can be carried out for each of the properties under consideration, producing a feature vector which is a representation of all the descriptors (of types C, D and T) for each region. This complete process is shown schematically in Figure 4. In Figure 4, 8 properties are considered (charge, polarizability, polarity, hydrophobicity, bulkiness, relative mutability, solvent accessibility, and normalized van der Waals volume), and the figure shows schematically how the properties of the three types (represented by the smallest squares) are obtained, and combined to give a feature vector.
2. First Experimental Implementation
In this example, which we will refer to elsewhere as "AllerPred", the embodiment is used as a prediction system for assessment of potential allergenicity of protein sequences.
2.1 The polypeptide database
The polypeptide database comprised 1906 (669 allergens and 1237 non- allergens) sequences.
The polypeptide database was divided into training and testing sets. The training set consists of 631 IUIS approved allergens from the ALLERDB database (Zhang et al., 2006; the disclosure of which is incorporated by reference) and 1219 non-allergens derived from Bjorklund et al., 2005 (of which the disclosure is incorporated by reference). This partition was performed using a de-biasing strategy based on sequence similarity of protein sequences commonly found in consumed food with no records in existing allergen databases (Saha et al. 2006). The percentage of allergens represents -34% of the testing dataset, while non-allergens represent the remaining 66%.
The testing dataset includes 38 IUIS allergens and 18 experimentally validated non-allergens extracted from the literature (Chakraborty et al., 2000; Laffer et al., 2003; Epton et al., 2002; Rihs et al. 2003; Ortona et al. 2003; Szakos et al. 2004; Dearman et al. 2001 ; Dearman et al. 2003; Banerjee et at. 2002; Takai et al. 1997; Mine et al. 2003).
2.2 Derivation of the dataset from the training set
For each of the eight properties shown in Figure 4, the amino acids were grouped into three classes so that, as mentioned above, for each property there were a total of 210 descriptors are used to describe each protein sequence: 30 for C, 30 for T and 150 for D. Thus, there were 1680 values in the feature vector for each polypeptide.
2.3 The non-linear prediction model
The implementation used a support vector machine (SVM) as the non-linear prediction model. A comprehensive coverage of SVMs is provided in the literature (Joachim, 2002; Vapnik, 1998). In brief, SVMs belong to a class of statistical learning methods based on the structural risk minimization principle. It is known for the inputs to the SVM to be binary strings or feature vectors representing encoded representations of amino acid attributes previously reported as significant for characterization of protein families. In the first implementation of the embodiment, the parameters of the SVM were trained by mapping the input vectors into a high dimensional feature space and constructing an optimal separating hyperplane in the new feature space. The optimal separating hyperplane maximizes the margin between the positive and negative datasets and uniquely classifies the data into positive and negative examples. Different kernel functions (linear, polynomial, radial, and sigmoid) were explored to optimize the prediction accuracy of the SVM models.
2.4 Model evaluation
For each kernel function, 10-fold internal cross validation was performed to assess to quality of the model (Tong et al., 2006). In /c-fold cross-validation, k random, (approximately) equal-sized, disjoint partitions of the sample data are constructed, and a given model is trained on {k-\) partitions and tested on the excluded partition. The results are averaged after k such experiments, and the observed error rate may be taken as an estimate of the error rate expected upon generalization to new data.
The predictive performance of each model was assessed using sensitivity (SE), specificity (SP) and receiver operating property (ROC) analysis as described previously (Tong et al., 2006). SE=TP/(TP+FN) and SP=TN/(TN+FP), represent percentages of correctly predicted allergens and non-allergens, respectively. TP (true positives) stands for allergens correctly predicted as allergens and TN (true negatives) for non-allergens correctly predicted as non-allergens. FN (false negatives) refers to allergens predicted as non-allergens and FP (false positives) represents non-allergens predicted as allergens. The accuracy of our predictions was assessed by ROC analysis where the ROC curve is generated by plotting SE as a function of (1-SP) for various classification thresholds. The area under the ROC curve (AROC) provides a measure of overall prediction accuracy, AROC<70% for poor, AROC>80% for good and AROC>90% for excellent predictions (Tong et al., 2006).
2.5 Results
The predictive performances of different kernel functions (linear, polynomial, radial, and sigmoid) were compared. In the example, the best results were obtained using a third degree polynomial kernel function. The AROC value is 0.81. Using amino acid composition as input for training and testing, the system can predict allergenic proteins with SE of 76.00% and SP of 76.00%.
A variety of techniques have previously been described for prediction of allergenicity in proteins using different testing datasets. To benchmark our system, the same testing dataset comprising of 38 IUIS allergens and 18 experimentally validated non-allergens were used to evaluate three existing allergenicity prediction techniques - wavelet transform models (Li et al., 2004), SVM models based on global descriptors (Cai et al., 2003) and sequence similarity search based on FAO/WHO Codex alimentarius guidelines (Fiers et al., 2004).
The results indicate that the embodiment which utilizes local sequence descriptors for training SVM models (AROC=0.81), consistently outperforms SVM models based on global sequence descriptors (AROC=0.71; Cai et al., 2003), wavelet analysis (AROC=0.69; Li et al., 2004) and FAO/WHO sequence similarity search (AROC=0.58; Fiers et al., 2004).
2.6 Discussion Collectively, the results indicate that the embodiment can make accurate predictions for potential allergenicity of proteins that have been validated using IUIS allergens and experimentally validated non-allergen sequences.
The property encoding scheme explained above allows the embodiment to model multiple overlapping continuous and discontinuous B-cell epitope binding patterns within a protein sequence. The system is trained using official allergens approved by the International Union of Immunological Societies (IUIS) Allergen Nomenclature Sub-Committee plus non-allergens commonly found in consumed food with no records in existing allergen databases, and tested on experimentally validated allergens and non-allergen sequences.
An advantage of the adaptive system herein described is that it takes into account conformational and overlapping B-cell epitope recognition sites. This results in improved prediction accuracy,
3. Second Experimental Implementation
In this second implementation, the training set consisted of 559 IUIS approved allergens that do not belong to the Betulaceae or Birch family (i.e. nut-bearing trees) from the ALLERDB database (Zhang et al., 2006) and 1219 non- allergens randomly extracted using in-house filtering software as described above.
The test dataset included 110 official allergens derived from the Betulaceae or Birch family. The prediction results were of high accuracy (AROC=O.75). Because the allergen has numerous binding sites scattered throughout the entire molecule, the embodiment's selection of local descriptors may not be the optimal representation of the said family and may be further improved. Nonetheless, the high accuracy of prediction indicates that the embodiment is capable of generalizing new data. The embodiment has the additional advantage that all the predictions were produced using a single predictive model.
References
The disclosure in the following references is hereby incorporated in its entirety:
Bock, J. R. and Gough, D.A. (2002) A new method to estimate ligand-receptor energetics. MoI. Cell. Proteomics, 1, 904-910. Bjδrklund, A.K., Soeria-Atmadja, D., Zorzet, A., Hammerling, U. and Gustafsson, M. G. (2005) Supervised identification of allergen-representative peptides for in silico detection of potential allergenic proteins. Bioinformatics, 21, 39-50.
Cai, C.Z., Han, LY., Ji, Z.L., Chen, X. and Chen, Y.Z. (2003) SVM-Prot: web- based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res., 31, 3692-3697. Dubchak, L1 Muchnik, L, Holbrook, S. R. and Kim, S.-H. (1995) Prediction of protein class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 92, 8700-8704. FAO/WHO (2003) Codex Principles and Guidelines on Foods Derived from
Biotechnology. Rome, Italy, Joint FAO/WHO Food Standards Programme. Chakraborty, S., Chakraborty, N. and Datta, A. (2000) Increased nutritive value of transgenic potato by expressing a nonallergenic seed albumin gene from Amaranthus hypochondriacus. Proc. Natl. Acad. Sci. USA, 97, 3724-3729. Laffer, S., Hamdi, S., Lupinek, C, Sperr, W.R., Valent, P., Verdino, P., Keller, W., Grate, M., Hoffmann-Sommergruber, K., Scheiner, O., Kraft, D., Rideau, M., Valenta, R. (2003) Molecular characterization of recombinant T1 , a nonallergenic periwinkle (Catharanthus roseus) protein, with sequence similarity to the Bet v 1 plant allergen family. Biochem J., 373, 261-269. Li1 K.-B., Issac, P. and Krishnan, A. (2004) Predicting allergenic proteins using wavelet transform. Bioinformatics, 20, 2572-2578.
Epton, M.J., Smith, W., Hales, BJ. , Hazell, L., Thompson, PJ. and Thomas,
W. R. (2002) Non-allergenic antigen in allergic sensitization: responses to the mite ferritin heavy chain antigen by allergic and non-allergic subjects. Clin.
Exp. Allergy, 32, 1341-1347.
Rihs, H. P., Dumont, B., Rozynek, P., Lundberg, M., Cremer, R., Bruning, T. and Raulf-Heimsoth, M. (2003) Molecular cloning, purification, and IgE-binding of a recombinant class I chitinase from Hevea brasiliensis leaves (rHev b 11.0102). Allergy, 58, 246-251.
Fiers, M.W., Kleter, GA, Nijland, H., Peijnenburg, A.A., Nap, J. P. and van Ham, R.C. (2004) Allermatch™, a webtool for the prediction of potential allergenicity according to current FAO/WHO Codex alimentarius guidelines. BMC Bioinformatics, 5, 133. Ortona, E., Margutti, P., Delunardo, F., Vaccari, S., Rigano, R., Profumo, E., Buttari, B., Teggi, A. and Siracusano, A. (2003) Molecular and immunological characterization of the C-terminal region of a new Echinococcus granulosus Heat Shock Protein 70. Parasite Immunol., 25,119-126.
Szakos, E., Lakos, G., Aleksza, M., Gyimesi, E., Pall, G., Fodor, B., Hunyadi, J., Solyom, E. and Sipka, S. (2004) Association between the occurrence of the anticardiolipin IgM and mite allergen-specific IgE antibodies in children with extrinsic type of atopic eczema/dermatitis syndrome. Allergy, 59, 164-167.
Dearman, RJ. and Kimber, I. (2001) Determination of protein allergenicity: studies in mice. Toxicol Lett, 120,181-186. Dearman, RJ., Stone, S., Caddick, H.T., Basketter, D.A. and Kimber, I. (2003) Evaluation of protein allergenic potential in mice: dose-response analyses. CHn. Exp. Allergy, 33, 1586-1594.
Banerjee, B., Kurup, V.P., Greenberger, P.A., Kelly, KJ. and Fink, J.N. (2002) C-terminal cysteine residues determine the IgE binding of Aspergillus fumigatus allergen Asp f 2. J. Immunol., 169, 5137-5144. Takai, T., Yuuki, T., Okumura, Y., Mori, A. and Okudaira, H. (1997)
Determination of the N- and C-terminal sequences required to bind human
IgE of the major house dust mite allergen Der f 2 and epitope mapping for monoclonal antibodies. MoI. Immunol., 34, 255-261. Mine, Y., Sasaki, E. and Zhang, J.W. (2003) Reduction of antigenicity and allergenicity of genetically modified egg white allergen, ovomucoid third domain. Biochem. Biophys. Res. Comm., 302, 133-137. Saha, S % G. P. Raghava, (2006), AlgPred: Prediction of allergenic proteins and mapping of IgE epitopes. Nucleic Acids Res, 34, W202-9. Tong J. C, G. L. Zhang, T. W. Tan, J. T. August, V. Brusic & S. Ranganathan:
Prediction of HLA-DQ3.2beta ligands: evidence of multiple registers in class Il binding peptides, Bioinformatics, 22, 1232-8 (2006). Zhang, Z. H., J. L. Koh, G. L. Zhang, K. H. Choo, M. T. Tammi & J. C. Tong
(2007), AllerTool: a web server for predicting allergenicity and allergic cross- reactivity in proteins, Bioinformatics, 23, 504-6 (2007).
Zhang, Z. H., S. C. Tan, J. L. Koh, A. Falus & V. Brusic (2006), ALLERDB database and integrated bioinformatic tools for assessment of allergenicity and allergic cross-reactivity, Ce// Immunol, 244, 90-6 (2006).

Claims

Claims
1. A method for predicting at least one target property of a target polypeptide, the method comprising:
(a) a learning phase of (i) for each of a plurality of polypeptides not including the target polypeptide, forming a respective training example comprising the target property of the respective polypeptide and, for each of a plurality of pre-defined regions of the polypeptide, one or more descriptor values representing corresponding properties of the region of the peptide; and (ii) training a non-linear prediction model using the training examples; and
(b) a prediction phase of:
(i) obtaining for the target polypeptide, the values of said descriptors for the pre-defined regions of the target polypeptide; and
(ii) inputting the obtained descriptor values for the target polypeptide into the non-linear prediction model; whereby the non-linear prediction model outputs a value which is a prediction of the target property of the target polypeptide.
2. A method according to claim 1 , wherein, for at least one of the regions, the properties comprise one or more of the charge, polarity, polarizability, hydrophobicity, size, bulkiness, relative mutability, solvent accessibility or normalized van der Waals volume of at least one amino acid residue in the region.
3. A method according to claim 1 , wherein for each property, the amino acids are grouped into classes such that the amino acids of different classes exhibit the property to different degrees, and the descriptor values for that property describe how the amino acid residues of the region fall into the classes.
4. A method according to claim 3, wherein, for at least one property, at least one descriptor value represents the proportion of amino acid residues containing a particular class of the property along the region.
5. A method according to claim 3 or claim 4, wherein, for at least one property, at least one descriptor value represents the frequency with which a particular property changes from one class to another class along the region.
6. A method according to any of claims 3 to 5, wherein, for at least one property, at least one descriptor value represents the location along the region of the first, 25, 50, 75 and 100% of amino acid residues with a specific class of the property.
7. A method according to any one of claims 1 to 6, wherein the number of pre-defined regions is at least five.
8. A method according to any preceding claim, wherein the non-linear prediction system is selected from the group consisting of support vector machine, probabilistic function, artificial neural network, hidden Markov model, multiple regression and Bayesian network.
9. A method according to any of claims 1 to 7 in which the non-linear prediction model is a support vector machine, and the step of training the support vector machine comprises the steps of creating a hyperplane separating and classifying the training examples according to a kernel function, the method including a process of optimising the kernel function.
10. A method according to any preceding claim, wherein the target property is an allergenicity.
11. A method according to any preceding claim, wherein the target property is the binding ability to a specific antibody or class of antibodies
12. A computer system comprising a processor programmed to perform a method according to any one of claims 1 to 11.
13. A computer program product comprising software executable by a computer system to cause the computer system to perform the method of any one of claims 1 to 11.
PCT/SG2007/000293 2006-09-11 2007-09-06 Method of predicting protein allergenicity using a support vector machine WO2008033100A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US84350906P 2006-09-11 2006-09-11
US60/843,509 2006-09-11

Publications (2)

Publication Number Publication Date
WO2008033100A1 true WO2008033100A1 (en) 2008-03-20
WO2008033100A8 WO2008033100A8 (en) 2009-07-23

Family

ID=39184050

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2007/000293 WO2008033100A1 (en) 2006-09-11 2007-09-06 Method of predicting protein allergenicity using a support vector machine

Country Status (1)

Country Link
WO (1) WO2008033100A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339180B (en) * 2008-08-14 2012-05-23 南京工业大学 Organic compound combustion and explosion characteristic prediction method based on support vector machine
CN104252581A (en) * 2013-06-26 2014-12-31 中国科学院深圳先进技术研究院 Method for predicting transmembrane protein residue function relationship based on SVM (support vector machine)
CN104615910A (en) * 2014-12-30 2015-05-13 中国科学院深圳先进技术研究院 Method for predicating helix interactive relationship of alpha transmembrane protein based on random forest
CN105181933A (en) * 2015-09-11 2015-12-23 北华航天工业学院 Method of predicting soil compressibility coefficient
CN106066910A (en) * 2016-05-30 2016-11-02 中国地质大学(武汉) A kind of pointwise band weight polynomial locus model method for building up based on kernel function
CN106339755A (en) * 2016-08-29 2017-01-18 深圳市计量质量检测研究院 Lithium battery SOH (State of Health) prediction method based on neural network and periodic kernel functions GPR
CN107169532A (en) * 2017-06-14 2017-09-15 北京航空航天大学 A kind of car networking fuel consumption data method for evaluating quality based on wavelet analysis and semi-supervised learning
EP3293240A4 (en) * 2015-05-07 2018-09-26 The School Corporation Kansai University Agent having anti-ice nucleation activity
CN112951341A (en) * 2021-03-15 2021-06-11 江南大学 Polypeptide classification method based on complex network
CN113591399A (en) * 2021-08-23 2021-11-02 贵州大学 Short-term wind power prediction method
CN113936748A (en) * 2021-11-17 2022-01-14 西安电子科技大学 Molecular recognition characteristic function prediction method based on ensemble learning
CN114708931A (en) * 2022-04-22 2022-07-05 中国海洋大学 Method for improving prediction precision of drug-target activity by combining machine learning and conformation calculation

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704559B (en) * 2019-09-09 2021-04-16 武汉大学 Multi-scale vector surface data matching method

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
BJÖRKLUND A. ET AL.: "Supervised identification of allergen-representative peptides for in silico detection of potentially allergenic proteins", BIOINFORMATICS, vol. 21, no. 1, 2005, pages 39 - 50 *
BRUSIC V. ET AL.: "Computatinonal methods for prediction of T-cell epitopes-a framework for modelling, testing, and applications", METHODS, vol. 34, 2004, pages 436 - 443 *
CUI J. ET AL.: "Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties", MOLECULAR IMMUNOLOGY, vol. 44, 2007, pages 514 - 520, XP005622933, DOI: doi:10.1016/j.molimm.2006.02.010 *
LI K.-B. ET AL.: "Predicting allergenic proteins using wavelet transform", BIOINFORMATICS, vol. 20, no. 16, 1 November 2004 (2004-11-01), pages 2572 - 2578 *
RIAZ T. ET AL.: "WebAllergen: a web server for predicting allergenic proteins", BIOINFORMATICS, vol. 21, no. 10, 15 May 2005 (2005-05-15), pages 2570 - 2571 *
SAHA S. ET AL.: "AlgPred: prediction of allergenic proteins and mapping of IgE epitopes", NUCLEIC ACIDS RESEARCH, vol. 34, 1 July 2006 (2006-07-01) *
SAHA S. ET AL.: "Prediction of Continuous B-Cell Epitopes in an Antigen Using Recurrent Neural Network", PROTEINS: STRUCTURE, FUNCTION, AND BIOINFORMATICS, vol. 65, 2006, pages 40 - 48 *
SOERIA-ATMADJA D. ET AL.: "Statistical Evaluation of Local Alignment Features Predicting Allergenicity Using Supervised Classification Algorithms", INTERNATIONAL ARCHIVES OF ALLERGY AND IMMUNOLOGY, vol. 133, 2004, pages 101 - 112 *
ZHANG Z.H. ET AL.: "AllerTool: a web server for predicting allergenicity and allergic cross-reactivity in proteins", BIOINFORMATICS, vol. 23, no. 4, 2007, pages 504 - 506 *
ZORZET A. ET AL.: "Prediction of Food Protein Allergenicity: A Bio-informatic Learning Systems Approach", IN SILICO BIOLOGY, vol. 2, 2002, pages 525 - 534 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339180B (en) * 2008-08-14 2012-05-23 南京工业大学 Organic compound combustion and explosion characteristic prediction method based on support vector machine
CN104252581A (en) * 2013-06-26 2014-12-31 中国科学院深圳先进技术研究院 Method for predicting transmembrane protein residue function relationship based on SVM (support vector machine)
CN104615910A (en) * 2014-12-30 2015-05-13 中国科学院深圳先进技术研究院 Method for predicating helix interactive relationship of alpha transmembrane protein based on random forest
EP3293240A4 (en) * 2015-05-07 2018-09-26 The School Corporation Kansai University Agent having anti-ice nucleation activity
CN105181933B (en) * 2015-09-11 2017-04-05 北华航天工业学院 The method of prediction soil compression coefficient
CN105181933A (en) * 2015-09-11 2015-12-23 北华航天工业学院 Method of predicting soil compressibility coefficient
CN106066910A (en) * 2016-05-30 2016-11-02 中国地质大学(武汉) A kind of pointwise band weight polynomial locus model method for building up based on kernel function
CN106339755A (en) * 2016-08-29 2017-01-18 深圳市计量质量检测研究院 Lithium battery SOH (State of Health) prediction method based on neural network and periodic kernel functions GPR
CN107169532A (en) * 2017-06-14 2017-09-15 北京航空航天大学 A kind of car networking fuel consumption data method for evaluating quality based on wavelet analysis and semi-supervised learning
CN107169532B (en) * 2017-06-14 2020-07-03 北京航空航天大学 Internet of vehicles fuel consumption data quality evaluation method based on wavelet analysis and semi-supervised learning
CN112951341A (en) * 2021-03-15 2021-06-11 江南大学 Polypeptide classification method based on complex network
CN112951341B (en) * 2021-03-15 2024-04-30 江南大学 Polypeptide classification method based on complex network
CN113591399A (en) * 2021-08-23 2021-11-02 贵州大学 Short-term wind power prediction method
CN113936748A (en) * 2021-11-17 2022-01-14 西安电子科技大学 Molecular recognition characteristic function prediction method based on ensemble learning
CN114708931A (en) * 2022-04-22 2022-07-05 中国海洋大学 Method for improving prediction precision of drug-target activity by combining machine learning and conformation calculation
CN114708931B (en) * 2022-04-22 2023-01-24 中国海洋大学 Method for improving prediction precision of drug-target activity by combining machine learning and conformation calculation

Also Published As

Publication number Publication date
WO2008033100A8 (en) 2009-07-23

Similar Documents

Publication Publication Date Title
WO2008033100A1 (en) Method of predicting protein allergenicity using a support vector machine
Shen et al. Identification of helix capping and β-turn motifs from NMR chemical shifts
Muh et al. AllerHunter: a SVM-pairwise system for assessment of allergenicity and allergic cross-reactivity in proteins
Dudek An artificial immune system for classification with local feature selection
Camproux et al. A hidden markov model derived structural alphabet for proteins
Schein et al. Bioinformatics approaches to classifying allergens and predicting cross-reactivity
Kuang et al. Protein backbone angle prediction with machine learning approaches
Mishra et al. Mapping B‐cell epitopes of major and minor peanut allergens and identifying residues contributing to IgE binding
US11749377B2 (en) Method and electronic system for predicting at least one fitness value of a protein, related computer program product
Barrat-Charlaix et al. Sparse generative modeling via parameter reduction of Boltzmann machines: application to protein-sequence families
Ivanciuc et al. The property distance index PD predicts peptides that cross-react with IgE antibodies
Yu et al. Qualitative and quantitative prediction of food allergen epitopes based on machine learning combined with in vitro experimental validation
Tong et al. Prediction of protein allergenicity using local description of amino acid sequence
Ehlers et al. Can alternative epitope mapping approaches increase the impact of B‐cell epitopes in food allergy diagnostics?
Lollier et al. A generic approach to evaluate how B-cell epitopes are surface-exposed on protein structures
Han et al. Quality assessment of protein docking models based on graph neural network
Dănăilă et al. The applications of machine learning in HIV neutralizing antibodies research—A systematic review
Tomer et al. Prediction of celiac disease associated epitopes and motifs in a protein
Sunny et al. DeepBindPPI: Protein–Protein Binding Site Prediction Using Attention Based Graph Convolutional Network
WO2023086999A1 (en) Systems and methods for evaluating immunological peptide sequences
Zaki et al. Mining residue contacts in proteins using local structure predictions
Wang et al. Evaluation and comparison of newly built linear B-Cell epitope prediction software from a users' perspective
Sun et al. B-cell epitope prediction method based on deep ensemble architecture and sequences
Halfon et al. ContactNet: Geometric-based deep learning model for predicting protein-protein interactions
Huang et al. Predicting B cell epitope residues with network topology based amino acid indices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07808925

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07808925

Country of ref document: EP

Kind code of ref document: A1