METHODS AND SYSTEMS FOR THE PREDICTION OF THE BIOLOGICAL FUNCTION OF BIOPOLYMERS
CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Application No. 60/366,778, filed March 25, 2002, the entire disclosure of which is incoiporated herein by reference.
NOTICE OF COPYRIGHT PROTECTION A portion of the disclosure of this patent document and its figures contain material subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document, but otherwise reserves all copyrights whatsoever.
FIELD OF THE INVENTION The present invention generally relates to prediction of the function of a protein sequence. The present invention more particularly relates to prediction of the function utilizing quantitative measures of the constituents of a protein sequence.
BACKGROUND DNA is the repository of information about much of life as we know it.
Experiments such as the Human Genome Project lead to large amounts of DNA sequence data that needs to be classified. Many research efforts are underway trying to make sense of that information. DNA sequences fully determine the amino acid sequences in their expressed proteins. The information contained in the DNA sequences contains redundancies, noise, and special signal sequences (such as terminators). DNA directly expresses RNA sequences, which can function on their own or be used as amino acid templates to create proteins. These amino acid sequences are the primary structures of the proteins. Much research has been put into determining the secondary structure given the primary structure of proteins. An interesting question is whether the secondary structure is necessary to detenxiine important properties of a protein.
Proteins can be classified in many ways: solubility (albumins, histones, etc.), shape (globular or fibrous), 3-dimensional structure (helices, sheets, etc.), and biological
function are examples. A natural and useful classification of proteins is according to biological function. The problem then becomes how to determine biological function from primary sequence structure. Such a classification system (based on the functionality of the proteins) can then be used in assessing the suitability of a given sequence for targeting of new drug design as well as decoding of the human DNA. In drug design, a protein function classifier can test the suitability of a large number of proteins for a given set of functionalities and without any lab experiment reduce the search space by ranking the most suitable proteins as targets for a given biological problem. In addition, a protein function prediction system can discover the functions of the code sequences in the human DNA.
Several attributes are available within a given sequence related to composition and distribution of amino acids. For example, the molecular weight (the sum of the atomic weights of the atoms contained in the protein) of a protein is a very simple classification that is experimentally observable and is directly related to the amino acid composition. Unfortunately, size alone does not correlate well with biological function. For example, although Isomerases, Protease, and Lipase enzymes are very close in size, they vary greatly in function. An example of the distribution of amino acids being useful for biological function classification is that collagen contains a primary structure where each third residue is glycine with a preceding residue of proline or hydroxyproline. One broad classification of biological function is as an enzyme. Enzymes are proteins that facilitate chemical reactions and are named according to the substrate they act upon. Proteins that facilitate breaking down lipids (fats) are called lipases. For example, a lipase enzyme first of all needs a lipophilic part of the protein to interact with the lipids. A lipase enzyme also needs an active site, which would be a non-contiguous sequence of amino acids (example: Serl26 Hisl76 Asp206 form the active site for a lipase but are far apart in sequence number). This information is buried within the primary structure and is not easily recognized. A statistical model might be useful in extracting this type of information.
Classifying protein sequences according to function is not a new problem. Conventional methods for classifying protein functions include creating three- dimensional and other imaging techniques and performing multiple alignment. Conventional three-dimensional imaging techniques include freezing a protein,
crystallizing it, and using NMR or X-ray to create an image of the crystalline structure. While tliree-dimensional imaging can be accurate for determining the structure of a protein, it is also very time consuming. In the year or so that it takes to perform the imaging, the protein may have mutated in many new and unexpected ways. These mutations sometimes change the function of the protein sequence. In addition, the images that are created may be of lower resolution than is optimal, forcing the scientist to guess at the actual shape of the protein.
Conventional multiple alignment methods assume that if two .proteins are similar in structure, they are also similar in function (See, e.g., U.S. Patent Nos. 6,023,659 and 5,845,049). Sequence databases have been used to train neural networks as classifiers.
The features used for classification included the complete primary structures and N-gram encodings of primary structures. Complete primary structure encodings involve translating each amino acid to a unique identifier. The identifier can be a set of bit flags for example (21 bits, one for each amino acid and one for no amino acid). In many conventional multiple alignment methods, only a portion of the sequence is analyzed, and the analysis is combined with heuristics to try to determine the function.
Other encodings include a relative hydrophobicity scale and accepted point mutation values. The set of identifiers can then be presented to a neural network as either one long vector or, as in the N-gram method, as a set of fixed length vectors. These algorithms can be extremely time consuming and tend to be inaccurate predictors of the function of proteins. For example, a single mutation may cause a protein to function in a much different way even though the structure is almost identical to the original sequence.
Other statistical methods have also been applied to the classification of protein structures (See, e.g., U.S. Patent No. 6,304,868). These techniques are "nearest neighbor" type methods where the nearness is a measure of distances as changes in primary structures. These studies do not address the question of whether or not there are features other than the entire primary structure that can be used to classify proteins. This is an important point since for a given protein classification there is a large range of amino acid sequence sizes. There exists a need to create more accurate and faster models that predict the function of proteins while filtering out uninformative repetitions and patterns.
SUMMARY Embodiments of the present invention include methods and systems for predicting a biological function of a biopolymer, such as a protein or DNA. One exemplary embodiment of the present invention encodes and uses the functions of amino acids in a protein to predict the overall functions of the protein. One method according to the present invention includes determining the biochemical properties of the amino acids, creating a representation of the protein by combining the biochemical properties, and predicting the function of the protein based at least in part on the representation. Many biochemical properties may be relevant to the function of a protein. For example, in one embodiment, hydrophobicity, solubility, and a quantum value are determined for each amino acid.
Another method according to the present invention includes determining signal- processing measures for the protein based at least in part on the biochemical properties and classifying the protein based at least in part on the signal-processing measures. These measures may include, for example, fractal dimension, entropy, mobility, and complexity. The biochemical properties may be measured or retrieved from a database.
Embodiments of the present invention provide numerous advantages over conventional methods for determining the function of a protein. An embodiment of the present invention provides a significantly more accurate and reliable predictor of protein function. Unlike in conventional methods, the encoding process is not arbitrary and is directly based on biochemical properties of amino acids. Also, in an embodiment of the present invention utilizing measures of a protein, one classifier is able to classify a protein having an arbitrary length. The resulting model may be processed much more quickly than a model including the entire structure of the protein and requires no specialized computing facilities. Also, an embodiment of the present invention provides accurate predictions for proteins with very similar structure but highly differentiated function. Further details and advantages of the present invention are set forth below.
BRIEF DESCRIPTION OF THE FIGURES These and other features, aspects, and advantages of the present invention are better understood when the following Detailed Description is read with reference to the accompanying drawings, wherein:
Figure 1 is a flowchart illustrating a method according to the present invention for predicting the function of a protein based on length-dependent measures;
Figure 2 is a flowchart illustrating a method according to the present invention for predicting the function of a protein using length-independent measures; Figure 3 is a plot showing one standard deviation of Higuchi fractal dimension measures for the solubility encoding of the tlπee protein classes in one embodiment of the present invention;
Figure 4 is a plot showing one standard deviation of Higuchi fractal dimension measures for the hydrophobicity encoding of three protein classes studied in one embodiment of the present invention;
Figure 5 is a plot showing one standard deviation of complexity measures for the solubility encoding of three protein classes studied in one embodiment of the present invention;
Figure 6 is a plot showing one standard deviation of complexity meas res for the hydrophobicity encoding of three protein classes studied in one embodiment of the present invention;
Figure 7 is a plot showing one standard deviation of mobility measures for the solubility encoding of three protein classes studied in one embodiment of the present invention; Figure 8 is a plot showing one standard deviation of mobility measures for the hydrophobicity encoding of three protein classes studied in one embodiment of the present invention; and
Figure 9 is a block diagram illustrating an implementation of an embodiment of the present invention in a drug discovery process.
DETAILED DESCRIPTION Embodiments of the present invention include methods and systems for predicting a biological function of a biopolymer. In one embodiment, a biochemical property of each of the plurality of monomers in the biopolymer is determined, a representation of the biopolymer is created by combining the biochemical properties, and a prediction of biological function is made based on the representation. In another embodiment,
measures are determined for the representation and the prediction is based at least in part on the measures.
The biopolymer may include a polymeric group and a side chain. In one embodiment, the biopolymer is a protein, wherein the polymeric group is an amino acid and the side chain is an aliphatic, an organic acid, an amide, a sulfur, an alcohol, an organic base, an aromatic, or an imine type. In another embodiment, the biopolymer is DNA, wherein the polymeric group is a phosphodeoxyribose or phosphoribose unit, and the side chain is either a purine (adenine or guanine) or a pyrimidine (thymine or cytosine). In one embodiment, the features used for prediction are length-dependent measures based on the biochemical properties of an amino acid. In another embodiment, the features are length-independent measures. In such embodiments, a classifier, such as a neural classifier, evaluates the features based on features of proteins having known function to predict the function of other proteins. Referring now to the drawings in which like numerals indicate like elements throughout the several figures, Figure 1 illustrates a method according to the present invention for predicting the function of a protein based on length-dependent measures. The method shown may comprise program code stored in a computer-readable medium and executed by a processor or may include a combination of both manual and computerized processes. In an embodiment implemented as computer program code, a processor first identifies the sequence of amino acids that constitute a protein of interest 102.
Processors can include, for example, digital logical processors capable of processing input, execute algorithms, and generate output as necessary to create the desired tactile sensations in the input device in response to the inputs received from that input device. Such controllers may include a microprocessor, an Application Specific Integrated Circuit (ASIC), and state machines. Such processors include, or may be in communication with, media, for example computer readable media, which stores instructions that, when executed by the processor, cause the processor to perform the steps described herein as carried out, or assisted, by a processor. Embodiments of computer-readable media include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the
processor in a web server, with computer-readable instructions. Other examples of suitable media include, but are not limited to, a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, ASIC, configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read. Also, various other forms of computer-readable media may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel.
In one embodiment, the processor executes a neural network to evaluate at least one length-dependent measure of the amino acids in the protein sequence. Neural networks are valuable tools, closely related to statistical tools, used for classifying data. The multilayer nature of neural networks allows them to discover non-linear higher order correlations among the data. Neural networks consist of a set of interconnected "neurons", decision units that are activated based on their inputs. The inputs are weighted, the weights being determined during fitting of the input data. To assess the suitability of a given neural network, the input data is divided into a training set for creating the weights and a testing set for evaluating the neural network. Multiple layers can be used where the outputs from one layer are the inputs to the next. There are many variables associated with developing a neural network to model a problem. There are choices in the number of neurons and number of layers, the learning algorithm, and the activation functions of the neurons.
Neural networks, once properly trained, are fast classifiers. For example, three protein families - lipases, proteases, and isomerases - may be considered along with the further restriction of evaluating primary structures with 100 to 200 amino acids. This limitation is arbitrary, and embodiments of the present invention are not limited to evaluating structures with between 100 and 200 amino acids.
In determining the structure of the protein 102, the processor may search a database. For example, in one embodiment, the protein primary structures are obtained from the Swiss Protein Sequence database, which is freely available over the World Wide Web (http://www.ebi.ac.uk/swissprot/). A keyword search of the database for the proteins used in the examples described herein results in 295 lipases, 415, proteases and 426 isomerases having 100 to 200 amino acid units in length.
For each amino acid in the protein sequence, the processor next determines a biochemical property of the amino acid 104. Determining the biochemical property may include searching a database, calculating the property based on basic measures of the amino acid, or other performing some other task. The biochemical property may include, for example, the hydrophobicity, solubility, or quantum values of the amino acid.
The sequence data on the Swiss Protein Sequence database use the single letter per amino acid notation. This notation allows for some altered and unnamed amino acids, which are not part of the usual 20 amino acids expressed by DNA. The primary structures are preprocessed such that each amino acid is assigned a number 1 to 20 and blanks (no amino acids present) are assigned the number 0. The sequences are padded to the length of their range. For example, any sequence with less than 200 amino acids is padded to 200 with zeros to form the input vector for this particular sequence.
The processor may retrieve or calculate multiple biochemical properties for each amino acid 106. Also, if additional amino acids exist in the sequence 108, the processor repeats the process of determining each biochemical property for each of the amino acids in the sequence 104-106.
Once the processor deteπnines the biochemical properties of each of the amino acids in the sequence, the processor combines the biochemical properties to create a representation of the protein sequence 110. Once the biochemical properties are combined, the processor utilizes a classifier to predict the function of the protein sequence
116.
Various types of classifiers may be used in various embodiments, including a neural network, a maximum likelihood classifier, a Bayesian classifier, a fuzzy classifier, a rough set classifier, a nearest neighbor classifier, a Mahalanobis classifier, and a regression tree classifier. In one embodiment utilizing a neural network classifier, the neural network is trained using the Back Propagation learning algorithm, such as the Bayesian Regularization BackPropagation algorithm. Training is the process of fitting the neural network weights and biases to the data (the training data in this case). BackPropagation refers to the method of updating the weights using a gradient descent method and updating the weights going backward (from the output toward the input) during a training cycle. The adjustments occur after a complete presentation of the input data set, called an epoch in neural network terminology. As there are many variations for
function minimization there are many variations of the training algorithm for neural networks.
Regularization adds another parameter to the optimization process. This parameter controls the requirements for "smoothness" such that the optimization can be constrained to keep a smooth function at the expense of error or to minimize the error only. The Bayesian framework considers the neural network weights as random variables that can be updated using Bayes rule. Bayesian Regularization allows choosing much smaller neural networks with reasonable training times and good generalization.
The data relationships are expected to be non-linear so at least one layer needs a non-linear activation function. One such non-linear activation function is a tangent- sigmoidal function, and many other functions are available. The number of neurons and layers in the neural network may vary. For example the number of neurons and layers may be based on previous studies of the proteins of interest.
In one embodiment, the first neural network to be trained is based on the entire primary sequence of the proteins. This provides a baseline set of results for comparison to the feature extraction methods. For each class of protein, the input sequences are randomly separated into a training set and testing set. The distribution of sequences among the data files is summarized in Table 1. The reason for having a testing set of data for each protein family is the need to verify the generalization properties of the classifiers and to discover any possible overfitting problem.
Table 1. Data Distribution for Sequences of 1 00 to 200 Amino Acids
The neural networks may be created using an application, such as MATLAB 6.0 using the Neural Network Toolbox (http://www.mathworks.com). The type of neural network used in the example described is a feedforward-backpropagation network with 16 hidden nodes and 3 output nodes. The hidden nodes use the tansig activation function and the output nodes use the purelin activation function. The outputs are 1 for lipase, 2 for protease, and 3 for isomerase. Training is performed for three error cutoff values: 1E- 2, 1E-3, and 1E-4, using a scaled conjugant gradient method.
Translating each amino acid into their relative hydrophobicity creates another set of input values. Translating each amino acid into their solubility values creates yet another set. Hydrophobicity is a measure of the free amino acids dislike for an aqueous environment. While the solubility is conceptually related to hydrophobicity, the experimental values available are not correlated.
In one embodiment, the first network is trained using the complete sequence of amino acids in the proteins (using token values 0 - 20 to represent the amino acids). The effect of overfitting can be explored by changing the convergence criteria. The trained neural network's performance is summarized in Table 2. A comparison of convergence criteria clearly indicates the problem of data overfitting, as the difference between the testing and training is around 30% in all cases. The smallest convergence criterion of 1E- 4 performs better in classifying the test isomerase data over the 1E-2 converged network but the opposite is true of test data for protease (the 1E-2 network outperforms the 1E-4 network).
A separate neural network is trained to recognize sequences listed as the hydrophobic values of the amino acids. The range of hydrophobic values is 1.05 to 4.11. For elements of the input vector with no amino acids a value of 0.0 is used. The resulting network, trained to target convergence of 1E-2 and 1E-4, is shown in Table 3. Comparing these table values to the values for appropriate columns in Table 2 show that
the hydrophobic values are slightly more classifiable than using integer markers for individual amino acids. It is important to note that using hydrophobic values will in some cases map differing amino acids to much closer values (arginine has a hydrophobic value of 2.23 and threonine has a hydrophobic value of 2.25) than integers (where the difference between each amino acid can be no less than 1). Overall the number of misclassifications of test data using hydrophobicity is 91 for the convergence value of 1E- 4 compared to 105 for the integer encoding. However, it can be seen that due to closeness of training and testing accuracies of the classier with 1E-2 convergence in all cases, in order to avoid overfitting, this classifier is preferred over that of 1E-4 convergence. Even with 1E-2 classifier, there is some degree of overfitting as the difference between the testing and training accuracies amounts to about 30%.
Table 3. Results of Neural Networ for Complete Amino Acid Sequence where Amino
Acids are Represented by their Hydrophobicity Values The third set of data uses solubility values. These values have a broader range than hydrophobicity values. They range from 0.778 for aspartic acid to 162.3 for proline. Three amino acids have values of "very" soluble. In the embodiment shown, an arbifrary high number, 500, is used for these amino acids. Also, sequences are padded with the 0 value. The results in Table 4 show an overall better fit than hydrophobicity (total wrongly classified is 68 for solubility versus 91 for hydrophobicity at 1E-4 convergence).
Even though there are still some overfitting concerns for lipase and protease sequences, there is no clear pattern of better performance for testing at the lower convergence
criteria. In addition, it is interesting that with this set of values the isomerases are correctly classified for test and training data at both levels of convergence criteria. This measure seems to make the isomerase stand out against the other two enzyme types.
Table 4. Results of Neural Network for Complete Amino Acid Sequence where Amino
Acids are Represented by their Solubility Values. The embodiments described above are clearly length-dependent, i.e. the developed models work on proteins with up to a given fixed length. Moreover, in order to encode proteins with shorter lengths, zero padding is needed. Such a zero-padding process is not optimal either from the signal processing point of view or from the standpoint of biological or chemical properties. This also limits the usage of such classifiers on functional prediction of protein sequences with higher lengths, as for each given length a new classifier has to be trained and used.
In another embodiment, length-independent measures (features) are used to predict the function of a protein. The description below compares their classification capabilities with the above-described length-dependent measures. Figure 2 is a flowchart illustrating a method according to the present invention for predicting the function of a protein using length-independent measures.
Steps 202-210 of the process illustrated by Figure 2 are identical to the steps carried out in steps 102- 110 of the embodiment illustrated by Figure 1. A processor deteπriines the protein sequence and determines one or more biochemical properties of the amino acids included in the protein sequence 202-208. The processor then combines the properties to create a representation of the protein sequence 210.
However, in the embodiment shown in Figure 2, rather than classifying the protein after determining the biochemical properties, the processor next determines a measure or feature of the representation 212. The measure may include any measure that is relevant to the functioning of the protein, such as fractal dimension, entropy, mobility, and complexity. The processor may determine multiple measures for each representation 114.
And once the measures are calculated, the processor utilizes a classifier to predict the function of the protein sequence 116.
For example, in one embodiment complexity features are extracted for each input set and used to develop new neural networks. The measures include the complexity of the sequence, the mobility of the sequence, and the fractal dimension of the sequence. All these features have been previously used in signal processing literature and are known to extract the complexity of the patterns hidden in a sequence.
In one such embodiment, the input vectors form a set of 3 measures developed from their corresponding sequence. Each measure is calculated for a given protein sequence (with arbitrary length), and fed to a neural network. These measures are calculated for sequences formed by the integer encoding, hydrophobicity encoding, and solubility encoding of the protein sequence. These measures are calculated only on the actual signal portion of the sequence, i.e. the padded zeros are ignored, as zero padding is no longer needed for length-independent classifiers. The measures used are signal complexity measures and hence all in some sense measure the relative changes in values along a sequence (based on whether the sequence has smooth regular changes or sharp seemingly random changes). The first two measures are Complexity and Mobility. These measures have been used in biomedical signal processing literature for measuring of the variations in EEG and EMG signals, and can be defined as follows.
Assume that the characteristics of a given protein sequence is expressed as a vector x, where xt represents the code assigned to the amino acid in location i of the sequence. Also, let d - Ax represent the vector of variations in x, i.e. dt = χt - xiA. Moreover, define g as the vector of variations in d, i.e. g = Ad . Then use the above definitions for x, d and g to define the following fundamental statistical parameters:
Equationl : S0 = (∑ xt )ln
Equation! : Sx - J(∑ d, ) l(n ~ 1)
Equation?) : S2 = (∑ g,-2 ) /(» - 2)
Where n is the number of amino acids in a sequence. Now, define Complexity and Mobility factors as:
Equation 4: Complexity = ^(S2 2 1 S ) - (Sf /S0 2) Equation 5 : Mobility = S S0
As can be seen, Mobility addresses the first order variations of the signal, while Complexity deals with the first and the second order variations. Only the complexity measure is used as a feature here.
Another length-independent measure is the average entropy of a sequence. This uses the entropy measure:
Where the probabilities for each amino acid are taken to be the relative frequency within the Swiss Protein Databank.
Another measure that may be used is the fractal dimension. Fractal dimension describes the complexity of a fundamental pattern that can generate the entire sequence by scaling and shifting of itself. Several methods of estimating the fractal dimension of a sequence have been reported in the signal processing literature. Among all these techniques, the Higuchi algorithm is known to be the most accurate and efficient method of estimating the fractal dimension, and as a result, is used in this research. See e.g., T. Higuchi, "Approach to an Irregular Time Series on the Basis of the Fractal Theory,"
Physica D, Vol. 31, pp. 277-283, 1998. From a time series with N points a set of k sub- series are obtained each with a different step size or interval size (where m = 1,2,3,...,k):
N — m
Equation 7: ™ : X(m),X(m + k),X(m + !k),...,X(m + [ ]* jfc) k
The length of the curve X"' is:
Equation 8:
Plotting the curve lengths versus log2(k) gives an estimate for the fractal dimension as the negative slope of this plot. This is only an estimate and our sequences are much shorter than the sequences used to justify Higuchi' s algorithm. With this caveat in mind, we only use these numbers for comparison among sequences so we empirically test the usefulness of this measure to see if the errors are biased. The Higuchi algorithm is know to those skilled in the art. Accordingly, a more detailed description of the Higuchi algorithm is not provided herein.
In an embodiment of the present invention, the optimal number of hidden neurons may vary. In the embodiment described below, only results from 3-8-3 and 3-10-3 networks are shown.
Table 5 shows results for the complexity measures of the integer encoding of the sequences. The performance of the neural network improves with the 3-8-3 architecture compared to the 3-10-3 architecture. The improvement comes at the expense of isomerase classification (isomerase gets worse but protease gets better with the increase in the number of hidden neurons). An important observation about the length- independent classifier used here is that in all cases, the classifier avoids overfitting to a very high degree of accuracy, i.e. testing and training accuracies are fairly close to each other. This means that using the complexity measures used here results in classifiers that are much more reliable and exhibit a significantly more suitable generalization performance compared to the length-dependent ones discussed above. A comparison of the corresponding values in Tables 1 and 5 indicates that the testing classification errors (that are the true reliable measures of the quality of the classifiers) are very close to each other. This proves that the length-independent method, while avoiding overfitting, gives classification accuracies that are close to length-dependent classifiers.
Sequence - using the Integer Encoded Sequence The final table (Table 6) shows the values for hydrophobic and solubility sequences as measured with the complexity measures. These results used the 3-8-3 neural network. A comparison of Tables 3,4 and Table 6 indicates that the overfitting is much less visible in Table 6 (i.e. length-independent algorithm). The errors in the testing set of Table 6 are moderately higher than the errors for Tables 3 and 4 for all cases except the solubility encoded length-independent lipases. The full length Isomerase classification has no error. This is the opposite of the length-independent cases where Lipase and Protease classification is better distinguished than Isomerase classification.
Sequence - using the Hydrophobic (Hydr.) Encoded Sequences and the Solubility (Sol.)
Encoded Sequences (3-8-3 network) The length-independent features in the embodiments described above can result to classifiers that provide classification accm-acies very close to the classification algorithms that use the entire protein sequence as the input to the classifier. Moreover, such length- independent classifiers are much less susceptible to overfitting problem and give testing and training accuracies (or errors) that are very close to each other. This can be attributed to the fact the smaller number of input neurons results to simpler structure and smaller number of parameters to be trained. This in turns avoids overfitting of the training data by the algorithm. The superiority of the length-independent algorithms from the standpoint of overfitting indicates that the length-independent algorithms are much more reliable (if they are trained to a desirable level of accuracy).
Also, classifiers using length-independent features can classify protein sequences of any length. This is a clear advantage over the length-dependent classifiers that work only on sequences with a certain given length. Length-dependent classifiers must be trained for a given length and as a result in order to classify a number of proteins with different length, such classifiers must be separately trained.
The length-dependent classifier developed for solubility provided an accuracy level of up to 100%>. Such a classifier can be used for a number of practical procedures in drug design and analysis of the human DNA. A functional classifier can reduce the search space for a protein that provides a certain set of functions and save a significant amount of lab experimentation by ranking the proteins that are most suitable for the desired functions. Also, such a classifier can predict the function of a gene in the human DNA. This will expedite the functional sequencing of the human DNA. In another embodiment of the present invention, proteins are obtained from the
Swiss Protein Sequence database. A keyword search results in 295 lipases, 415 proteases and 426 isomerases having 100 to 200 amino acid units in length. In this embodiment, the data are preprocessed to remove fragments (by removing any sequence with the word "fragment" in the description), yielding 254 lipases, 148 proteases and 275 isomerases for classification. These data ranges are arbitrary, and other embodiments utilize different input ranges. As in the previously-described embodiment, back-propagation is used.
In this embodiment, the data are randomly assigned to training or testing sets (roughly 70% training and 30% testing). The numbers assigned to each group for each class of data are shown in Table 7.
Table 7. Data Distribution for Sequences of 100 to 200 Amino Acids
The neural networks are again created in MATLAB 6.0 using the Neural Network Toolbox. The type of neural network is a feedforward-backpropagation network with various numbers of hidden nodes in order to find the optimal network size. All nodes use the tansig activation function. For each classification, Isomerase, Lipase, or Protease, a neural network classifier is created and trained. The neural networks are trained to classify their chosen enzyme as +1 and the other two enzymes as -1.
The input to the neural networks is a set of 46 length independent signal processing features as described above in relation to equations 1-8. The first set of 40 features measures the complexity and mobility of each amino acid. The amino acid sequence is converted to O's and l's for each amino acid (20 amino acids yield 20 binary sequences with 2 measures each for a total of 40 measures). An example encoding for Alanine, A, is shown below: Amino Acid Sequence: AVWCEVWCEAA Binary Sequence: 10000000011 The second set of 3 features is complexity, mobility, and fractal dimension for each sequence as represented by the amino acid hydrophobicity value. Each amino acid is replaced by its experimental hydrophobicity value to create a new sequence. The features are calculated off of the new sequence.
The third set of 3 features is similar to the above 3 features except that the sequence used is the experimental solubility values from another source.
A comparison of various encodings and their resulting features are shown in Figures 3-8. These show plots of the average measure found with tick marks at the plus and minus one standard deviation of the measure. This shows the relative range and
tightness of the measures. Each plot compares a single encoding and single measure for each of the three classes of protein studied.
The first two figures show fractal dimension calculated with the solubility and hydrophobicity encodings. One immediate observation is that the hydrophobicity values are much tighter than the solubility values (note that the range of Y values is not the same for all figures). Another observation is that the calculated fractal dimensions are unreasonable since they are over 2.0 (they should be between 1.0 and 2.0). The error in fractal dimension could be due to the short sequences (with few data points k is small in Equation 7). However, removing the fractal dimension values from the set of features does not improve the neural network classifiers so it is kept. Figure 3 is a plot showing one standard deviation of Higuchi fractal dimension measures for the solubility encoding of the three protein classes.
The amount of overlap of these measures: the Isomerase and Lipase enzymes have the smallest overlap whereas the Protease enzyme measures overlap both Isomerase and Lipase measures almost equally. This pattern is similar for both the solubility and the hydrophobicity encoding. An interpretation of the relationship of these measures is that Lipase enzymes are not as self-similar, they are more linear, or they are smoother than Isomerase enzymes with respect to hydrophobicity or solubility. This pattern also indicates the difficulty of differentiating Protease enzymes from Lipase or Isomerase enzymes with this measure. Figure 4 is a plot showing one standard deviation of Higuchi fractal dimension measures for the hydrophobicity encoding of three protein classes studied. Figure 5 is a plot showing one standard deviation of complexity measures for the solubility encoding of three protein classes studied. And Figure 6 is a plot showing one standard deviation of complexity measures for the hydrophobicity encoding of three protein classes studied.
The complexity measures shown in Figures 5 and 6 shows a large difference with the encoding used. The solubility encoding differentiates the Lipase enzyme from the Protease and Isomerase enzymes but the hydrophobicity encoding does not readily differentiate any enzyme (they all have similar averages and their first standard deviations are within the first standard deviation of the Isomerase enzyme).
Figure 7 is a plot showing one standard deviation of mobility measures for the solubility encoding of tlπee protein classes studied. Figure 8 is a plot showing one
standard deviation of mobility measures for the hydrophobicity encoding of three protein classes studied.
The mobility measures are similar to the complexity measures in that the Lipase enzymes are the most easily differentiated. Also, the solubility encoding shows greater differentiation for Lipase than the hydrophobicity encoding. As can be seen with the figures above, there is considerable overlap with these features for the three classifications of enzymes.
The neural classifiers used for length-independent classification, the input vectors form a set of 46 measures developed from their corresponding sequence. Each measure is calculated for a given protein sequence (with arbitrary length), and fed to a neural network. These measures are calculated for sequences formed by the hydrophobicity encoding and solubility encoding of the protein sequence. These measures are calculated only on the actual signal portion of the sequence, i.e. there is no zero padding.
Many network architectures were explored but only the best performing with the least number of neurons are discussed. The network architecture is trained tlπee times using the same training/test data sets to determine the variability due to randomized starting weights for the network. The percentages that are correctly classified are shown in Tables 8 - l l.
Table 8. Isomerase network: percentage correctly classified.
The Isomerase classification network has 46 input nodes, 5 hidden nodes, and 2 output nodes and was trained for 50 epochs during each run. The resulting networks are not overtrained (low difference between testing and training results) and generalize well (good results for testing data).
Table 9. Lipase network: percentage correctly classified.
The Lipase network is even better performing than the Isomerase network. It also has 46 input nodes, 5 hidden nodes, and 2 output nodes. It was trained for only 20 epochs (this network overfitted the data when trained for 50 epochs). The resulting network generalizes well and is not overtrained.
An optimal Protease classification network is difficult to find. The networks are easily overtrained even when relatively large numbers of neurons are used (up to a 10,10,2 network was tried and also a 20,2 network). Typically, the training data is twice as accurate as the testing data for the Protease sequences. In Table 10, run 1, the protease training data is 48.6% correct and the testing data is 24.3% accurate (2: 1). The network in Table 10 is trained for only 20 epochs (attempting to not overfit the data) but is comparable in accuracy to the network in Table 11, which is trained for 300 epochs. The training data for run 1 of Table 11 is 97.3% correct and the testing data is only 56.8% correct (still roughly 2:1).
Table 11 : Protease network trained for 300 epochs.
The variability among the runs is much greater for the Protease networks than for the Isomerase or Lipase networks. These networks simply cannot generalize very well; but it should be noted that the number of framing samples for Protease is much smaller than for the other enzymes.
In another embodiment, a network is created and tested using the best of the networks of the embodiments described above. This network attempts to classify the Lipases, any non-classified data is passed to the Isomerase classifier, and finally to the Protease classifier. The results of this network are shown in Table 12.
Table 12: Results for Classifier Built from Three Best Single Trained Classifiers
The feature vectors not classified by any of the three neural networks are counted as "None".
The Isomerase and Lipase enzymes (of length 100-200) are easily classified using these 46 features whereas the Protease enzymes are not as easily classified. The Isomerase and Lipase classifiers generalize well and are not overtrained (they perform equally well with training and testing data). The new feature of complexity and mobility calculated for an amino acid binary encoded sequence resulted in much better accuracy over the previous studied classifiers.
Embodiments of the present invention may be utilized in a number of commercial applications. For example, one embodiment of the present invention is utilized in a drug discovery and design process. Figure 9 illustrates an implementation of an embodiment of the present invention in a drug discovery process. In a drug discovery process, one or more physicians identify a disease 902. Biologists, biochemists, immunologists or other scientists identify candidates for treating the disease 904. The pool of candidates is prepared for analysis by the computer or processor 906. The computer then performs an algoritlπn, such as the algoritlπn described in relation to Figures 1 and 2, to filter and rank the candidates 908. This process may be referred to as in silica drug screening.
Once the best candidates are identified, they are tested and compared in the lab 910. The desired end product of the process is the best drug to treat the disease identified by the physician 912.
Another embodiment is utilized in the analysis of gene functions (fiinctional sequencing of DNA). Various embodiments of the present invention are applicable to the process of designing proteins for chemical engineering applications, such as membrane design, controlling toxic chemicals, and improving food quality. Embodiments of the present invention may also b e utilized in designing proteins for nanotechnology, providing the building blocks for nana-robots. In yet another embodiment of the present invention, available drugs are matched to the mutated strain of the human immunodeficiency virus (HIV) currently in a patient's body. Rather than providing the patient with a cocktail of various drugs or randomly selecting a particular drug, the physician uses an embodiment of the present invention to match the proper drug for the current version of the retrovirus in the patient. The foregoing description of the preferred embodiments of the invention has been presented only for the purpose of illustration and description and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Numerous modifications and adaptations thereof will be apparent to those skilled in the art without departing from the spirit and scope of the present invention.