WO2002036812A2

WO2002036812A2 - Nonlinear system identification for class prediction in bioinformatics and related applications

Info

Publication number: WO2002036812A2
Application number: PCT/CA2001/001547
Authority: WO
Inventors: Michael Korenberg
Original assignee: Michael Korenberg
Priority date: 2000-11-03
Filing date: 2001-11-05
Publication date: 2002-05-10
Also published as: AU1487402A; EP1466289A2; WO2002037202A2; WO2002036812A9; CA2465661A1; AU2002214871A1; WO2002036812A3

Abstract

The present invention provides a method for class prediction in bioinformatics based on identifying a nonlinear system that has been defined for carrying out a given classification task. Information characteristic of exemplars from the classes to be distinguished is used to create training inputs, and the training outputs are representative of the class distinctions to be made. Nonlinear systems are found to approximate the defined input/output relations, and these nonlinear systems are then used to classify new data samples. In another aspect of the invention, information characteristic of exemplars from one class are used to create a training input and output. A nonlinear system is found to approximate the created input/output relation and thus represent the class, and together with nonlinear systems found to represent the other classes, is used to classify new data samples.

Description

NONLINEAR SYSTEM IDENTIFICATION FOR CLASS PREDICTION

IN BIOINFORMATICS AND RELATED APPLICATIONS

The present application claims priority from U.S. provisional applications No. 60/245,236, entitled "Use of Fast Orthogonal Search and Other Model-Building Techniques for Interpretation of Gene Expression Profiles", filed Nov. 3, 2000, No. 60/249,462, entitled "Nonlinear System Identification for Class Prediction in Bioinformatics and Related Applications", filed Nov. 20, 2000, and No. 60/268,019, entitled "Prediction of Clinical Outcome Using Gene Expression Profiles", filed Feb. 13, 2001 , and Canadian Patent Application No. 2,325,225, entitled "Nonlinear System Identification for Class Prediction in Bioinformatics and Related Applications", filed Nov. 20, 2000, which applications are incorporated herein by this reference.

TECHNICAL FIELD The present invention pertains to a method for class prediction in bioinformatics based on identifying a nonlinear system that has been defined for carrying out a given classification task.

BACKGROUND OF THE INVENTION Many areas in bioinformatics involve class prediction, for example: (1) assigning gene expression patterns or profiles to defined classes, such as tumor and normal classes; (2) recognition of active sites, such as phosphorylation and ATP-binding sites, on proteins; (3) predicting whether a molecule will exhibit biological activity, e.g., in drug discovery, including the screening of databases of small molecules to identify molecules of possible pharmaceutical use; (4) distinguishing exon from intron DNA and RNA sequences, and determining their boundaries; and (5) establishing genotype/phenotype correlations, for example to optimize cancer treatment, or to predict clinical outcome or various neuromuscular disorders. A recent paper by Golub et al. (1999), "Molecular Classification of

Cancer: Class Discovery and Class , Prediction by Gene Expression Monitoring", in Science, vol. 286, pp. 531-537, describes a method for predicting the class of tissue samples based on simultaneous monitoring of expression levels of thousands of genes via oligonucleotide microarrays, and is incorporated herein by this reference. Each tissue sample is represented by a gene expression pattern, often called a gene expression profile, a set of quantitative expression levels of the selected genes. The Golub et al. (1999) method is statistically-based and requires a number of profiles known to belong to each of the classes to be distinguished. A voting scheme is set up based on a subset of "informative genes" and each new tissue sample is classified based on a vote total, provided that a "prediction strength" measure exceeds a predetermined threshold. When the prediction strength is low, the class of the sample is uncertain, and resort must be made to other methods.

Accordingly, there is a need for an improved method of predicting the class of a genetic sample when there are only a few known examples (exemplars) of the classes to be distinguished. The revised method preferably uses little training data to build a finite-dimensional nonlinear system that then acts as a class predictor. The class predictor can be combined with other predictors to enhance classification accuracy, or the created class predictor can be used to classify samples when the classification by other predictors is uncertain.

SUMMARY OF THE INVENTION

In accordance with a preferred embodiment of the present invention, there is provided a method for constructing a class predictor in the area of bioinformatics. The method includes the steps of selecting information characteristic of exemplars from the families (or classes) to be distinguished, constructing a training input with segments containing the selected information for each of the families, defining a training output to have a different value over segments corresponding to different families, and finding a system that will approximate the created input/output relation. In accordance with an embodiment of the invention, the characterizing information may be the expression levels of genes in gene expression profiles, and the families to be distinguished may represent normal and various diseased states.

In accordance with another aspect of the present invention, there is provided a method for using fast orthogonal search or other model-building techniques to select genes whose expression levels can be used to distinguish between different families.

In accordance with yet another aspect of the present invention, there is provided a method for classifying protein sequences into structure/function groups, which can be used for example to recognize active sites on proteins, and the characterizing information may be representative of the primary amino acid sequence of a protein or a motif.

In accordance with yet another aspect of the present invention, there is provided a method for predicting whether a molecule will exhibit biological activity, e.g., in drug discovery, including the screening of databases of small molecules to identify molecules of possible pharmaceutical use. In this case the characterizing information may represent properties such as molecular shape, the electrostatic vector fields of small molecules, molecular weight, and the number of aromatic rings, rotatable bonds, hydrogen-bond donor atoms and hydrogen-bond acceptor atoms.

In accordance with yet another aspect of the present invention, there is provided a method for distinguishing between exon and intron DNA and RNA sequences, and for determining their boundaries. In this case the characterizing information may represent a sequence of nucleotide bases on a given strand.

In accordance with yet another aspect of the present invention, there is provided a method for finding genotype/phenotype relationships and correlations. In this case, the characterizing information may represent factors such as pathogenic mutation, polymorphic allelic variants, epigenetic modification, and SNPs (single nucleotide polymorphisms), and the families may be various human disorders, e.g., neuromuscular disorders. In accordance with yet another aspect of the present invention, there is provided a method for constructing a class predictor in the area of bioinformatics by combining the predictors described herein with other predictors.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described, by way of example only, with reference to the drawings in which:

Figure 1 illustrates the form of the parallel cascade model used in classifying the gene expression profiles, proteomics data, and the protein sequences. Each L is a dynamic linear element, and each N is a polynomial static nonlinearity;

Figure 2 shows the training input x(/) formed by splicing together the raw expression levels of genes from a first ALL profile #1 and a first AML profile #28. The genes used were the 200 having greatest difference in expression levels between the two profiles. The expression levels were appended in the same relative ordering that they had in the profile;

Figure 3 shows the training output y(i) (solid line) defined as -1 over the ALL portion of the training input and 1 over the AML portion, while the dashed line represents calculated output z(/) when the identified parallel cascade model is stimulated by training input x(/); Figure 4A shows the training input x(i) formed by splicing together the raw expression levels of genes from the first "failed treatment" profile #28 and first "successful treatment" profile #34; the genes used were the 200 having greatest difference in expression levels between the two profiles;

Figure 4B shows that the order used to append the expression levels of the 200 genes caused the auto-covariance of the training input to be nearly a delta function; indicating that the training input was approximately white;

Figure 4C shows the training output y(i) (solid line) defined as -1 over the "failed treatment" portion of the training input and 1 over the "successful treatment" portion; the dashed line represents calculated output z(/) when the identified model is stimulated by training input x(/);

Figure 5A shows the impulse response functions of linear elements ₂ (solid line), L, (dashed line), ₆ (dotted line) in the 2^nd, 4^th, 6^th cascades of the identified model;

Figure 5B shows the corresponding polynomial static nonlinearities N₂ (diamonds), Λ/₄ (squares), and Λ/₆ (circles) in the identified model. DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

(A) USE OF PARALLEL CASCADE IDENTIFICATION TO PREDICT CLASS OF GENE EXPRESSION PROFILES

In a first embodiment of the present invention, one or more representative profiles, or portions of profiles, from the families to be distinguished are concatenated (spliced) in order to form a training input. The corresponding training output is defined to have a different value over input segments from different families. The nonlinear system having the defined input/output relation would function as a classifier, and at least be able to distinguish between the training representatives (i.e., the exemplars) from the different families. A parallel cascade or other model is then found to approximate this nonlinear system. While the parallel cascade model is considered here, the invention is not limited to use of this model, and many other nonlinear models, such as Volterra functional expansions, and radial basis function expansions, can instead be employed. The parallel cascade model used here (FIG. 1) comprises a sum of cascades of dynamic linear and static nonlinear elements. "Dynamic" signifies that the element possesses memory: its output at a given instant / depends not only upon its input x at instant / but upon past input values at instants /-1 ,... , i-R (memory length = R+l). Every nonlinear element is a polynomial, so that each cascade output z_k(i) , and hence the overall model output z(/), reflect high-order nonlinear interactions of gene expression values. For generality of approximation, it suffices if each cascade comprises a dynamic linear element followed by a static nonlinearity (as in FIG. 1), but additional alternating dynamic linear and static nonlinear elements could optionally be inserted in any path: see Korenberg (1991) "Parallel Cascade Identification and Kernel Estimation for Nonlinear Systems", in Annals of Biomedical Engineering, vol. 19, pp. 429-455, which is incorporated herein by reference.

The memory length of the nonlinear model may be taken to be considerably shorter than the length of the individual segments that are spliced together to form the training input. Suppose that x(/) is the input, and y(i) the output, of a discrete-time nonlinear system, where / = 1 , ... , /. If y(i) depends only upon the input values x(i), x(/-1), ..., x(i-R), i.e., back to some maximum finite delay R, then the system is said to have a finite memory of length (R+1). (Alternatively, it could be said that the memory length is only R, because for a system with no memory the output y at instant / depends only upon the input x at that same instant.) If the assumed memory length for the model to be identified is shorter than the individual segments of the training input, the result is to increase the number of training examples. This is explained here in reference to using a single exemplar from each of two families to form the training input, but the same principle applies when more representatives from several families are spliced together to create the input. Note that, in the case of gene expression profiles, the input values will represent gene expression levels, however it is frequently convenient to think of the input and output as time-series data.

EXAMPLE 1 Consider distinguishing between profiles from two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), as in the paper by Golub et al. (1999). Each of their profiles contained the expression levels of 6817 human genes, but because of duplicates and additional probes in the Affymetrix microarray, in total 7129 gene expression levels were present in each profile. One simple approach is to simply splice together entire 7129 point profiles from the two families to form the training input. However, many or most of the genes in an entire profile might not be relevant to distinguishing between ALL and AML classes, so use of entire profiles could make the training of effective parallel cascade models more difficult. Moreover, the Golub et al. (1999) paper teaches the importance of finding "informative genes ... most closely correlated with ALL-AML distinction in the known samples".

Therefore, the first ALL profile (#1 of Golub et al. training data) and the first AML profile (#28 of their training data) were compared and 200 genes that exhibited that largest absolute difference in expression levels were located. In another embodiment of the present invention, a different number of genes may be located and used. For each of these genes, the absolute value of the difference between the corresponding raw scores on profiles #1 and 28 ranked in the top 200 of the 7129 genes. The raw expression values for these 200 genes were juxtaposed to form the ALL segment to be used for training, and the AML segment was similarly prepared. The 200 expression values were appended in the same relative order that they had in the original profile, and this is true for all the examples described in this patent application. However, it has been found that other ordering schemes may be beneficial, for example those that cause the autocovariance of the training input to be almost a delta (i.e., discrete impulse) function. This is discussed further in the section concerned with predicting treatment response and in the discussion of STEPS that follow Example 5. Each time the same order was used in juxtaposing the expression values to prepare a segment. The two information-rich segments were then spliced together to form a 400 point training input x(/) (FIG. 2), and the corresponding output y(i) (FIG. 3, solid line) was defined to be -1 over the ALL segment, and 1 over the AML segment.

A parallel cascade model was then identified with a memory length (R+l) = 10, and 5^th degree polynomial static nonlinearities, with 19 cascade paths in total. The values of 200 for the segment length and 10 for the memory length were chosen because Golub et al. (1999) reported that between 10 and 200 informative genes could be used in their procedure with excellent results (see further discussion below).

A 5^th degree polynomial was chosen for each static nonlinearity because that was the smallest degree found effective in a recent protein sequence classification application (Korenberg et al., 2000a, "Parallel Cascade Identification as a Means for Automatically Classifying Protein Sequences into Structure/Function Groups", vol. 82, pp. 15-21, which is incorporated herein by reference, attached hereto as Appendix A). Further details about parallel cascade identification are given below in the section involving protein sequence classification, and in Korenberg (1991).

Note that in finding the impulse responses of the dynamic linear elements in FIG. 1 , the procedure in Korenberg (1991) creates candidate impulse response functions, involving random elements which can be controlled by a random number generator. (A threshold test is used to decide which candidate cascades will be accepted into the model.) In all examples in this application, the same random number sequence was used every time, so that differences in classification due to a change in model parameters such as memory length or polynomial degree would be readily evident. Alternatively, if different random sequences are used in different trials of parameter values, then several models should be obtained for each possible choice of parameter values. In this way, changes in classification accuracy due to changes in parameter settings, rather than changes in the random sequence used, will become evident. Alternatively, an appropriate logical deterministic sequence, rather than a random sequence, can be used in creating candidate impulse responses: see Korenberg et al. (2001) "Parallel cascade identification and its application to protein family prediction", J. Biotechnol., Vol. 91 , 35-47, which is incorporated herein by this reference.

A maximum of 20 cascade paths was allowed to ensure that the number of variables introduced into the parallel cascade model was significantly less than the number of output points used in the identification, as also discussed in Korenberg et al. (2000a). The cascade paths were selected using a criterion (Korenberg, 1991) based on a standard correlation test. The criterion requires specification of a threshold value T (see Eq. (11) of Korenberg et al., 2000b, "Automatic Classification of Protein Sequences into Structure/Function Groups via Parallel Cascade Identification: A Feasibility Study", Annals of Biomedical Engineering, vol. 28, pp. 803-811 , which is incorporated herein by this reference, attached hereto as Appendix B), here set equal to 6. This value was chosen so that very nearly the maximum number (20) of cascade paths was selected out of 2000 candidate cascades. There is no special virtue in setting 2000 for the number of candidates, and some other large number could be used instead. Above the parameter values for memory length, polynomial degree, maximum number of cascades in the model, and threshold T were chosen from other work in bioinformatics. However, the choices could also be made by testing the effectiveness of trial values over a small verification set (Korenberg et al., 2000a) and this is preferable when the amount of data permits, and is illustrated in other examples below.

In this example, the identified model had a mean-square error (MSE) of 65.11 %, expressed relative to the variance of the output signal. Figure 3 shows that when the training input x(/) was fed through the identified parallel cascade model, the resulting output z(/) (dashed line) is predominately negative over the ALL segment, and positive over the AML segment, of the input. Only portions of the first ALL and the first AML profiles had been used to form the training input. The identified parallel cascade model was then tested on classifying the remaining ALL and AML profiles in the first set used for training by Golub et al. (1999). To classify a profile, the expression levels corresponding to the gehes selected above are appended in the same order as used above to form a segment for input into the identified parallel cascade model, and the resulting model output is obtained. If the mean of the model output is less than zero, the profile is assigned to the ALL class, and otherwise to the AML class. In calculating the mean output, the averaging preferably begins on the (R+1)-th point, since this is the first output point obtained with all necessary delayed input values known. Other classification criteria, for example based on comparing two MSE ratios (Korenberg et al., 2000b), could also be employed.

The classifier correctly classified 19 (73%) of the remaining 26 ALL profiles, and 8 (80%) of the remaining 10 AML profiles in the first Golub et al. set. The classifier was then tested on an additional collection of 20 ALL and 14 AML profiles, which included a much broader range of samples. Over this second set, the parallel cascade model correctly classified 15 (75%) of the ALL and 9 (64%) of the AML profiles. No normalization or scaling was used to correct expression levels in the test sequences prior to classification. It is important to realize that these results were obtained after training with an input created using only the first ALL and first AML profiles in the first set.

Means and standard deviations for the training set are used by Golub et al. in normalizing the log expression levels of genes in a new sample whose class is to be predicted. Such normalization may have been particularly important for their successfully classifying the second set of profiles which Golub et al. (1999) describe as including "a much broader range of samples" than in the first set. Since only one training profile from each class was used to create the training input for identifying the parallel cascade model, normalization was not tried here based on such a small number of training samples. Scaling of each profile, for example by dividing each expression level by the mean of all levels has been recommended (see, for example, Alon et al., 1999, "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays", Proc. Natl. Acad. Sci. USA, vol. 96, pp. 6745-6750, Cell Biology, which is incorporated herein by reference). Such scaling is a way of compensating for variations between profiles (Alon et al., 1999). Alternatively, each expression value can be divided by the root-mean-square of all the expression levels in the profile. Other forms of scaling will be well-known to persons skilled in the art. However, no scaling of the data obtained from the Golub et al web site (Datasets: http://www-genome.wi.mit.edu/MPR/data_set_ALL_AML.html) was used here, or below where accurate classification is achieved of the second set of profiles in Golub et al. (1999). It should be noted that Golub et al. had already re-scaled the intensity values so that overall intensities for each chip were equivalent.

The above parallel cascade classification results were obtained using raw gene expression levels. The results set out above may be slightly improved by instead using the log of the expression level, to the base 10. Before the log is taken, any negative or very small raw expression value is increased to 100. When this was done in the method of present invention, before identifying the parallel cascade model, on the same two training profiles used previously, 20 (77%) of 26 ALL and 9 (90%) of 10 AML profiles were correctly classified. Again the parallel cascade had a memory length of 10, and 5^th degree polynomial static nonlinearities. A higher threshold 7 = 10 was used, only 7 cascade paths were selected, and the resulting MSE was 80.67%.

EXAMPLE 2

Use of multiple training exemplars with raw expression levels

The above results for parallel cascade classification of gene expression profiles were obtained from using 200 genes' expression levels of one profile from each class to construct the training input. In contrast, the results reported by Golub et al. (1999) were for training with all, or all but one, of the 27 ALL and 11 AML exemplars in their first set. All but one exemplar in the first set were used for training when they predicted the class of the omitted profile in the same set (cross-validation), repeating this procedure until each of the profiles had been tested. All the exemplars were used for training when they predicted the class of profiles in the second set. Parallel cascade classifiers will not always be successful if the training input is based on only a single example of each of the classes to be distinguished. The single examples must be representative of other class members.

Multiple 200-point examples from each class can be appended to form a more information-rich training input, with the output again defined to have a different value over each segment from a different class. Suppose again that the nonlinear system to be identified has a memory of length (R+l), i.e., its output y(i) depends only upon x(/), ... ,x(i-R). Then in carrying out the system identification, we preferably exclude from calculation the first R points of each segment joined to form the input (Korenberg et al., 2000 a, b). This avoids introducing error due to the transition regions where the segments are joined together.

To continue the present example, the first 11 of the 27 ALL profiles in the first set of Golub et al. (1999) were each used to extract a 200-point segment characteristic of the ALL class. The first 5 profiles (i.e., #28 - #32) of the 11 AML profiles in the first set were similarly used, but in order to extract 11 200-point segments, these profiles were repeated in sequence #28 - #32, #28 - #32, #28. The 200 expression values were selected as follows. For each gene, the mean of its raw expression values was computed over the 11 ALL profiles, and the mean was also computed over the 11 AML profiles (which had several repeats). Then the absolute value of the difference between the two means was computed for the gene. The 200 genes having the largest of such absolute values were selected. The 11 ALL and 11 AML segments were concatenated to form the training input, and the training output was again defined to be -1 over each ALL segment and 1 over each AML segment.

Because there actually were 16 different 200-point segments, the increased amount of training data enabled many more cascades to be used in the model. To provide significant redundancy (more output points used in the identification than variables introduced in the parallel cascade model), a limit of 100 cascades was set for the model. It was found that if each polynomial in the cascade had a degree of 9, and if a threshold 7 = 12 was used, then nearly 100 cascades would be accepted out of 2000 candidates. Again, a memory length of 10 was used for each of the dynamic linear elements in the cascades. The identified model had a MSE of 62.72%. Since 11 ALL and 5 AML profiles had been used in constructing the training input, 16 ALL and 6 AML profiles remained for testing. The identified model correctly recognized 15 (93.8%) of the ALL and 5 (83.3%) of the AML test profiles. The model flawlessly classified the ALL and AML profiles used in constructing the training input.

The above examples may be briefly summarized in the following steps: Step 1 - Compare the gene expression levels in the training profiles and select a set of genes that assist in distinguishing between the classes.

Step 2 - Append the expression levels of selected genes from a given profile to produce a segment representative of the class of that profile. Repeat for each profile, maintaining the same order of appending the expression levels. Step 3 - Concatenate the representative segments to form a training input.

Step 4 - Define an input/output relation by creating a training output having values corresponding to the input values, where the output has a different value over each representative segment from a different class.

Step 5 - Identify a parallel cascade model (FIG. 1) to approximate the input/output relation.

Step 6 - Classify a new gene expression profile by (a) appending the expression levels of the same genes selected above, in the same order as above, to produce a segment for input into the identified parallel cascade model; (b) apply the segment to the parallel cascade model and obtain the corresponding output; and,(c) if the mean of the parallel cascade output is less than zero, then assign the profile to the first class, and otherwise to the second class.

EXAMPLE 3

Use of multiple training exemplars with log expression levels

Almost all of the above results were obtained using raw gene expression levels. However, as noted above, in place of the raw expression level, the logarithm of the expression level, to the base 10. Before the log is taken, any negative or very small raw value is increased to 100. Results are provided below for use of multiple training exemplars from each class with the log of the expression level.

The first 15 ALL profiles (#1 - #15 of Golub et al. first data set) were each used to extract a 200-point segment characteristic of the ALL class, as described immediately below. Since there were only 11 distinct AML profiles in the first Golub et al. set, the first 4 of these profiles were repeated, to obtain 15 profiles, in sequence #28 - #38, #28 - #31. For each gene, the mean of its raw expression values was computed over the 15 ALL profiles, and the mean was also computed over the 15 AML profiles. Then the absolute value of the difference between the two means was computed for the gene. The 200 genes having the largest of such absolute values were selected. This selection scheme is similar to that used in Golub et al. (1999) but in the present application, selection was made using raw expression values and not their logs, and the difference in the means was not measured relative to standard deviations within the classes. However, once a gene was selected, the log of its expression level was used, rather than the raw value, in forming the representative segment for each profile. The 15 ALL and 15 AML segments were concatenated to form the training input, and the training output was defined to be -1 over each ALL segment and 1 over each AML segment. Because there actually were 26 different 200-point segments, the increased amount of training data enabled many more cascades to be used in the model, as compared to the use of one representative segment from each class. To have significant redundancy (more output points used in the identification than variables introduced in the parallel cascade model), a limit of 200 cascades was set for the model. Note that not all the variables introduced into the parallel cascade model are independent of each other. For example, the constant terms in the polynomial static nonlinearities can be replaced by a single constant. However, to prevent over-fitting the model, it is convenient to place a limit on the total number of variables introduced, since this is an upper bound on the number of independent variables.

In Example 1 , when a single representative segment from each of the ALL and AML classes had been used to form the training input, the parallel cascade model to be identified was assumed to have a memory length of 10, and 5^th degree polynomial static nonlinearities. When log of the expression level was used instead of the raw expression level, the threshold 7 was set equal to 10. These parameter values are now used here, when multiple representative segments from each class are used in the training input with log expression levels rather than the raw values.

If the assumed memory length of the model is (R+l), then as noted above, in carrying out the identification, those output points corresponding to the first R points of each segment joined to form the training input are preferably excluded from computation. Since there were 26 different 200-point segments, and since the assumed memory length was 10, the number of output points used in the identification was considered (for the present purpose of preventing over-fitting) to be 26 x (200 - 9) = 4966. Since each cascade in the model will have a dynamic linear element with memory length 10, and a polynomial static nonlinearity of degree 5, then for a maximum of 200 cascades, the number of variables introduced will be at most 200 x (10 + 5 +1) = 3200. This is significantly less than the number of output points to be used in the identification, and avoids over-fitting the model.

The identified parallel cascade model had a mean-square error of about 71.3%, expressed relative to the variance of the output signal. While a limit of 200 cascades had been set, in fact, with the threshold 7 = 10, only 100 cascades were selected out of 2000 candidates. There is no special virtue in setting 2000 for the number of candidates, and some other large number could be used instead.

Classification results for ALL vs AML

The representative 200-point segments for constructing the training input had come from the first 15 of the 27 ALL profiles, and all 11 of the AML profiles, in the first data set from Golub et al. (1999). The performance of the identified parallel cascade model was first investigated over this data set, using two different decision criteria.

For each profile to be classified, the log of the expression levels corresponding to the 200 genes selected above are appended, in the same order as used above, to form a segment for input into the identified parallel cascade model. The resulting model output z(/), = 0, ... ,199, is obtained. The first decision criterion examined has already been used above, namely the sign of the mean output. Thus, if the mean of the model output was negative, the profile was assigned to the ALL class, and if positive to the AML class. In calculating the mean output, the averaging began on the 10^th point, since this was the first output point obtained with all necessary delayed input values known.

The second decision criterion investigated is based on comparing two MSE ratios and is mentioned in the provisional application (Korenberg, 2000a). This criterion compares the MSE of the model output z(i) from -1 , relative to the corresponding MSE over the ALL training segments, with the MSE of z(i) from 1 , relative to the MSE over the AML training segments. In more detail, the first ratio, ri, is

(z(i) +D² where the overbar denotes the mean (average over /), and βι is the MSE of the model output from -1 over the ALL training segments. The second ratio, r₂, is

(z( -l)² ^e2 where e₂ is the MSE of the model output from 1 over the AML training segments. Each time an MSE is computed, the averaging begins on the 10^th point, since the model has a memory length of 10 for this classification task. If r-[ is less than r₂, then the profile is classified as ALL, and if greater, as AML.

When the two decision criteria were compared over the 27 ALL and 11 AML profiles in the first Golub et al. (1999) data set, the second criterion, based on comparing MSE ratios, resulted in higher classification accuracy.

Hence the latter criterion was chosen to be used in class prediction over the second (independent) data set from Golub et al. (1999) that included a much broader range of samples. The parallel cascade model correctly classified all 20 of the ALL profiles, and 10 of the 14 AML profiles, in the second set.

In order to ensure that the successful classification over this set was not a fortuitous result of reusing the threshold value 7 = 10 from Example 1 , model identification and testing were repeated for a variety of different threshold values. Each time the same training input, which used multiple representative segments from each class, was employed. The threshold 7 was successively set equal to each integer value from 5 to 11 , with memory length fixed at 10, and polynomial degree at 5. The second decision criterion, based on MSE ratios, was used. The resulting parallel cascade models all had identical accuracy over the first data set, making one error on the 27 ALL and one error on the 11 AML profiles.

On the second data set, all the models successfully classified all 20 of the ALL profiles. Of the 14 AML profiles, the models' performance ranged from 8 - 12 correct classifications. The poorest score of 6 errors, for 7 = 11 , came from the model with the largest mean-square error when identified, about 73.3%, having 83 cascades. The best score of 2 errors, for 7 = 7, came from a model with a mean-square error of about 68.7%, having 137 cascades. In summary, all the models trained using 15 ALL and 11 AML profiles from the first Golub et al. data set made correct class predictions on the second data set except for 2 - 6 profiles. The model for threshold 7 = 7 stood out as the most robust as it had the best performance over the first data set using both decision criteria (sign of mean output, and comparing MSE ratios) of values nearest the middle of the effective range for this threshold. More importantly, the above accuracy results from using a single classifier. As shown in the section dealing with use of fast orthogonal search and other model-building techniques, accuracy can be significantly enhanced by dividing the training profiles into subsets, identifying models for the different subsets, and then using the models together to make the classification decision. This principle can also be used with parallel cascade models to increase classification accuracy.

DISCUSSION OF THE ABOVE EXAMPLE

The described nonlinear system identification approach utilizes little training data. This method works because the system output value depends only upon the present and a finite number of delayed input (and possibly output) values, covering a shorter length than the length of the individual segments joined to form the training input. This requirement is always met by a model having finite memory less than the segment lengths, but applies more generally to finite dimensional systems. These systems include difference equation models, which have fading rather than finite memory. However, the output at a particular "instant" depends only upon delayed values of the output, and present and delayed values of the input, covering a finite interval. For example the difference equation might have the form:

y(/) = F[ (/-1), ... ,y(/-/ι), x(/), ... ,x(/-/₂)] So long as the maximum of the output delay l and the input delay /₂ is less than the number of points in each input segment, we derive multiple training examples from each segment joined to form the input.

To illustrate, the parallel cascade model was assumed above to have a memory length of 10 points, whereas the ALL and AML segments each comprised 200 points. Having a memory length of 10 means that we assume it is possible for the parallel cascade model to decide whether a segment portion is ALL or AML based on the expression values of 10 genes. Thus the first ALL training example for the parallel cascade model is provided by the first 10 points of the ALL segment, the second ALL training example is formed by points 2 to 11 , and so on. Hence each 200-point segment actually provides 191 training examples, in total 382 training examples, and not just two, provided by the single ALL and AML input segments. Why was each segment taken to be 200 points, and the memory length 10 points? This was because, as noted above, the Golub et al. (1999) article reported that extremely effective predictors could be made using from 10 to 200 genes. In other embodiments of the present invention, a different number of points may be used for each segment or a different memory length, or both, may be used.

Each training exemplar can be usefully fragmented into multiple training portions, provided that it is possible to make a classification decision based on a fragmented portion. Of course the fragments are overlapping and highly correlated, but the present method gains through training with a large number of them, rather than from using the entire exemplar as a single training example. This use of fragmenting of the input segments into multiple training examples results naturally from setting up the classification problem as identifying a finite dimensional nonlinear model given a defined stretch of input and output data. However, the principle applies more broadly, for example to nearest neighbor classifiers. Thus, suppose we were given several 200-point segments from two classes to be distinguished. Rather than using each 200- point segment as one exemplar of the relevant class, we can create 191 10- point exemplars from each segment.

Moreover, the use of fragmenting enables nearest neighbor methods as well as other methods such as linear discriminant analysis, which normally require the class exemplars to have equal length, to work conveniently without this requirement. Thus, even if some of the original exemplars have more or less than, e.g., 200 points, they will still be fragmented into, e.g., 10-point portions that serve as class examples. To classify a query profile or other sequence, it is first fragmented into the overlapping 10-point portions. Then a test of similarity (e.g. based on a metric such as Euclidean distance) is applied to these fragmented portions to determine the membership of the query sequence.

Note that setting up the class predictor problem as the identification of a nonlinear system, with a memory length less than the shortest of the segments joined to form the training input, is a different approach from training an artificial neural network to predict class. In the latter case, the neural network is trained by simply presenting it with numerous examples of class members. There is no fragmentation of the exemplars to create multiple training examples from each using the sliding window approach described above for the case of system identification. As already noted, this fragmentation occurs naturally from having the nonlinear system's memory length shorter than the segments joined to form the training input. In addition, while we have considered the nonlinear system to be causal (i.e., its output can depend on present and lagged input values, but not advanced values) this is not essential, and systems also having finite anticipation can be considered for the classifier.

To identify genes that effectively distinguish between the classes, the absolute difference in either the raw expression level or the log of the expression level may be used, as described above. Alternatively, clustering of genes using the method of Alon et al. (1999) "reveals groups of genes whose expression is correlated across tissue types". The latter authors also showed that "clustering distinguishes tumor and normal samples even when the genes used have a small average difference between tumor and normal samples". Hence clustering may also be used to find a group of genes that effectively distinguishes between the classes.

Alternatively, fast orthogonal search (Korenberg, 1989a, "A Robust Orthogonal Algorithm for System Identification and Time Series Analysis", in Biological Cybernetics, vol. 60, pp. 267-276; Korenberg 1989b, "Fast Orthogonal Algorithms for Nonlinear System Identification and Time-Series Analysis", in Advanced Methods of Physiological System Modeling, vol. 2, edited by V.Z. Marmarelis, Los Angeles: Biomedical Simulations Resource, pp. 165-177, both of which are incorporated herein by reference), iterative fast orthogonal search (Adeney and Korenberg, 1994, "Fast Orthogonal Search for Array Processing and Spectrum Estimation", IEE Proc. Vis. Image Signal Process. , vol. 141 , pp. 13-18, which is incorporated herein by reference), or related model term-selection techniques can instead be used to find a set of genes that distinguish well between the classes, as described in the U.S. provisional application "Use of fast orthogonal search and other model-building techniques for interpretation of gene expression profiles", filed November 3, 2000. This is described next. USE OF FAST ORTHOGONAL SEARCH AND OTHER MODEL-

BUILDING TECHNIQUES FOR INTERPRETATION OF GENE EXPRESSION PROFILES

Here we show how model-building techniques, such as fast orthogonal search (FOS) and the orthogonal search method (OSM), can be used to analyze gene expression profiles and predict the class to which a profile belongs.

A gene expression profile p . can be thought of as a column vector containing the expression levels e_{I J }} i - l,...,I of / genes. We suppose that we have J of these profiles for training, so thaty = 1 ,... ,J. Each of the profiles p was created from a sample, e.g., from a tumor, belonging to some class.

The samples may be taken from patients diagnosed with various classes of leukemia, e.g., acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), as in the paper by Golub et al. (1999). Given a training set of profiles belonging to known classes, e.g., ALL and AML, the problem is to create a predictor that will assign a new profile to its correct class. The method described here has some similarity to that by Golub et al. (1999). However, the use of FOS, OSM, or an analogous form of model building is not disclosed in that paper. Indeed, the class predictor created here through the use of OSM is different and correctly classified more profiles in an independent set, using less training data, than in Golub et al. (1999).

First, consider distinguishing between two classes. To distinguish between ALL and AML classes, Golub et al. (1999) defined "an 'ideal expression pattern' corresponding to a gene that was uniformly high [1] in one class and uniformly low [0] in the other". They then selected a set of "informative genes" that were "chosen based on their correlation with the class distinction", and used these genes in a voting scheme to classify a given expression profile. In particular, the "set of informative genes to be used in the predictor was chosen to be the 50 genes most closely correlated with ALL-AML distinction in the known samples". A possible drawback of this approach is that some genes that might always have near-duplicate expression levels could be selected, which may unfairly bias the voting. Similar to how "ideal expression pattern" was defined (Golub et al., 1999), we define the output y(j) of an ideal classifier to be -1 for each profile P_j from the first class, and 1 for each profile p . from the second class. For each of the /^' genes, / = 1 ,... ,/, define g,(i) = ^ei,j >

(1) the expression level of the /-th gene in they-th training profile, / = 1.....J. We then wish to choose a subset g_m(j), m = l,...,M , of the / candidate functions g, (j) to approximate y(j) by

M yU) = _Ja_mg_m(j) + r(j) m=\

(2) where a_m is the coefficient for each term in the series, and where rtj) is the model error, so as to minimize the mean-square error (MSE)

(3)

The subset g_m (j), m = l,...,M , containing the model terms can be found by using FOS or OSM to search efficiently through the larger set of candidate functions g,(j) , / = 1 ,... ,/, as follows. In succession, we try each of the / candidate functions and measure the reduction in MSE if that candidate alone were best-fit, in the mean-square sense, to y(j), i.e., if M = 1 in Eq. (2). The candidate for which the MSE reduction would be greatest is chosen as the first term for the model in Eq. (2). To find the second term for the model, we set M = 2. Then each of the remaining l-l candidates is orthogonalized relative to the chosen model term. This enables the MSE reduction to be efficiently calculated were any particular candidate added as the second term in the model. We select the candidate for which the MSE reduction would be greatest to be the second model term, and so on.

In this scheme, candidate functions are orthogonalized with respect to already-selected model terms. After the orthogonalization, a candidate whose mean-square would be less than some threshold value is barred from selection (Korenberg 1989 a, b). This prevents numerical errors associated with fitting orthogonalized functions having small norms. It prevents choosing near duplicate candidate functions, corresponding to genes that always have virtually identical expression levels.

In fact, to increase efficiency, the orthogonal functions need not be explicitly created. Rather, FOS uses a Cholesky decomposition to rapidly assess the benefit of adding any candidate as a further term in the model. The method is related to, but more efficient than, a technique proposed by Desrochers (1981), "On an improved model reduction technique for nonlinear systems", Automatica, Vol. 17, pp. 407-409. The selection of model terms can be terminated once a pre-set number have been chosen. For example, since each candidate function g_t(j) is defined only for J values of j, there can be at most J linearly independent candidates, so that at most J model terms can be selected. (However, there will typically be far more than J candidates that are searched.) In addition, a stopping criterion, based on a standard correlation test (Korenberg 1989b), can be employed. Alternatively, various tests such as the Information Criterion, described in Akaike (1974) "A new look at the statistical model identification", IEEE Trans. Automatic Control, Vol. 19, pp. 716-723, or an F-test, discussed e.g. in Soderstrom (1977) "On model structure testing in system identification", Int. J. Control, Vol. 26, pp. 1-18, can be used to stop the process.

Once the model terms have been selected for Eq. (2), the coefficients a_m can be immediately obtained from quantities already calculated in carrying out the FOS algorithm. Further details^" about OSM and FOS are contained in the cited papers. The FOS selection of model terms can also be carried out iteratively (Adeney and Korenberg, 1994) for possibly increased accuracy.

Once the model of Eq. (2) has been determined, it can then function as a predictor as follows. If p_J+l is a novel expression profile to be classified, then let g_m (J + ϊ) be the expression level of the gene is this profile corresponding to the m-th model term in Eq. (2). (This gene is typically not the m-th gene in the profile, since g_m(j), the m-th model term, is typically not g_m (j) , the m-th candidate function.) Then evaluate M z ⁼ ∑a_mg_m J + l) , m=\

(4) and use a test of similarity to compare z with -1 (for the first class) and 1 (for the second class). For example, if z < 0, the profile may be predicted to belong to the first class, and otherwise to the second class.

Alternatively, suppose that MSE-_\ and MSE₂ are the MSE values for the training profiles in classes 1 and 2 respectively. For example, the calculation to obtain MSE^ is carried out analogously to Eq. (3), but with the averaging only over profiles in class 1. The MSE₂ is calculated similarly for class 2 profiles. Then, assign the novel profile p_J+1 to class 1 if

(z + l)² ^ -!)²

MSE, MSE.,

(5)

and otherwise to class 2. In place of using a mean-square test of similarity, analogous tests using absolute values or a power higher than 2 can be employed.

Alternatively, once the model terms for Eq. (2) have been selected by FOS, the genes to which they correspond can then be used as a set of "informative genes" in a voting scheme such as described by Golub et al. (1999).

Above, for simplicity, we have used the expression level of one gene to define a candidate function, as in Eq. (1). However, we can also define candidate functions in terms of powers of the gene's expression level, or in terms of crossproducts of two or more genes'expression levels, or the candidate functions can be other functions of some of the genes' expression levels. Also, in place of the raw expression levels, the logarithm of the expression levels can be used, after first increasing any negative raw value to some positive threshold value (Golub et al., 1999).

While FOS avoids the explicit creation of orthogonal functions, which saves computing time and memory storage, other procedures can be used instead to select the model terms and still conform to the invention. For example, an orthogonal search method (Desrochers, 1981 ; Korenberg, 1989 a, b), which does explicitly create orthogonal functions can be employed, and one way of doing so is shown in Example 4 below. Alternatively, a process that does not involve orthogonalization can be used. For example, the set of candidate functions is first searched to select the candidate providing the best fit to y(j), in a mean-square sense, absolute value of error sense, or according to some other criterion of fit. Then, for this choice of first model term, the remaining candidates are searched to find the best to have as the second model term, and so on. Once all model terms have been selected, the model can be "refined" by reselecting each model term, each time holding fixed all other model terms (Adeney and Korenberg, 1994).

In an alternative embodiment, one or more profiles from each of the two classes to be distinguished can be spliced together to form a training input. The corresponding training output can be defined to be -1 over each profile from the first class, and 1 over each profile from the second class. The nonlinear system having this input and output could clearly function as a classifier, and at least be able to distinguish between the training profiles from the two classes. Then FOS can be used to build a model that will approximate the input output behavior of the nonlinear system (Korenberg 1989 a, b) and thus function as a class predictor for novel profiles. It will also be appreciated that the class distinction to be made may be based on phenotype, for example, the clinical outcome in response to treatment. In this case, the invention described herein can be used to establish genotype phenotype correlations, and to predict phenotype based on genotype. Finally, while distinctions between two classes have been considered above, predictors for more than two classes can be built analogously. For example, the output y(j) of the ideal classifier can be defined to have a different value for profiles from different classes. Alternatively, as is well known, the multi-class predictor can readily be realized by various arrangements of two- class predictors.

EXAMPLE 4

The first 11 ALL profiles (#1 - #11 of Golub et al. first data set), and all

11 of the AML profiles (#28 - #38 of the same data set), formed the training data. These 22 profiles were used to build 10 concise models of the form in Eq. (2), which were then employed to classify profiles in an independent set in Golub et al. (1999). The first 7000 gene expression levels in each profile were divided into 10 consecutive sets of 700 values. For example, to build the first model, the expression levels of genes 1 - 700 in each training profile were used to create 700 candidate functions g_t(j) . These candidates were defined as in Eq. (1), except that in place of each raw expression level e. . , its log was used:

g_t U) = log_{10 tJ} , / = 1 , ... ,700 y^' = 1. ... ,22,

after increasing to 100 any raw expression value that was less. Similarly, genes 701 - 1400 of each training profile were used to create a second set of 700 candidate functions, for building a second model of the form in Eq. (2), and so on. The training profiles had been ordered so that the p ., fory = 1 ,... ,11 corresponded to the ALL profiles, and fory = 12, ... ,22 to the AML profiles. Hence the training output was defined as

= 1, = 12,...,22

The following procedure was used to find each model. First, each of the 700 candidate functions was tried as the only model term in Eq. (2) (with M

= 1), and its coefficient chosen to minimize the MSE given by Eq. (3). The candidate for which the MSE was smallest was selected as the first model term gj j) . For j = 1 , ... ,22, define a first orthogonal function as w_} (j) = (j) , with its coefficient c_x . Assume that the model already has M terms, for M ≥ 1 , and a (M+1)-th term is sought. Fory = 1 ,... ,22, let w_m (j) be orthogonal functions created from the model chosen terms g_m (j), m = 1 ,... ,M. Let c_m be the corresponding coefficients of the w_m (j), where these coefficients were found to minimize the mean-square error of approximating y(j) by a linear combination of the w_m (j), m = 1 ,... , . Then, for each candidate g_t(j) not already chosen for the model, the modified Gram-Schmidt procedure is used to create a function orthogonal to the w_m (j), m = 1 ,... ,M. Thus, fory = 1 22, set ₊ι U) = S, U) - O iWi 0^') where

And, forr=2,...,M, andy^'= 1,... ,22, set ll 0^") = +l^} (/) - ^O .Λ 0^") where

The function w^ (j) (which was created from the candidate g_l (j) ) is orthogonal to the w_m(j), m = 1,...,M. If the candidate g_t(j) were added to the model as the ( +1)-th term, then it follows readily that the model MSE would equivalents be

where

Therefore, the candidate g,(j) for which e is smallest is taken as the ( +1)-th model term g_M+ιU) , the corresponding w^ j) becomes w_w+1(/), and the corresponding c_M+1 becomes c_M+l. Once all model terms g_m(j) have been selected, their coefficients a_m that minimize the MSE can be readily calculated (Korenberg, 1989a,b). Each of the 10 models was limited to five model terms. The terms for the first model corresponded to genes #697, #312, #73, #238, #275 and the model %MSE (expressed relative to the variance of the training output) was 6.63%. The corresponding values for each of the 10 models are given in Table 1. The coefficients a_m of the model terms g_m (j) were obtained for each model by least-squares fitting to the training output y(j),j = 1 ,...,22.

The 10 identified models were then used to classify profiles in an independent, second data set from Golub et al. (1999), which contained 20 ALL and 14 AML test profiles. For each model, the value of z was calculated using Eq. (4) with M = 5, and with g_m (J + l) equal to the log of the expression level of the gene in the test profile corresponding to the m-th model term. For example, for model #1 , g^J + 1) , g₂(J + l) ,... ,g₅(J + ϊ) corresponded to the log of the expression level of gene #697, #312,..., #275 respectively, in the test profile. The values of z for the 10 models were summed; if the result was negative, the test profile was classified as ALL, and otherwise as AML.

By this means, all 20 of the test ALL, and all 14 of the test AML, profiles in the independent set were correctly classified. These classification results, after training with 22 profiles, compare favorably with those for the Golub et al. method. Golub et al. (1999) used the entire first data set of 38 profiles to train their ALL-AML predictor, which then was able to classify correctly all of the independent set except for five where the prediction strength was too low for a decision.

In order to investigate how small an individual model could be and still allow the combination to be an effective classifier, the procedure was repeated using the above sets of 700 candidate functions, but this time to build 10 two-term models, which are summarized in Table 2.

Again, a test profile was classified by calculating the value of z for each model using Eq. (4), this time with M = 2, and then adding the values of z for the 10 models; if the result was negative, the test profile was classified as ALL, and otherwise as AML. By this means, all 20 of the test ALL, and 13 of the 14 test AML, profiles in the independent set were correctly classified.

Moreover, it was found that considerably less than full training profiles sufficed to maintain this level of accuracy for the two-term models. The identical classification accuracy was obtained with only models #1 - #6 (whose construction required only the first 4200 genes of each training profile), or with additional models. The first 4200 gene expression levels in each of 22 profiles that sufficed for training here represent less than 40% of the data used by Golub et al. (1999).

It should be noted that individually the models made a number of classification errors, ranging from 1 - 17 errors for the two-term and from 2 - 11 for the five-term models. This was not unexpected since each model was created after searching through a relatively small subset of 700 expression values to create the model terms. However, the combination of several models resulted in excellent classification.

The principle of this aspect of the present invention is to separate the values of the training gene expression profiles into subsets, to find a model for each subset, and then to use the models together for the final prediction, e.g. by summing the individual model outputs or by voting. Moreover, the subsets need not be created consecutively, as above. Other strategies for creating the subsets could be used, for example by selecting every 10^th expression level for a subset.

This principle can increase classification accuracy over that from finding a single model using entire gene expression profiles. Note that here, since the output y(j) was defined only fory = 1 ,...,22, at most 22 independent terms could be included in the model (which would allow no redundancy), but identifying a number of models corresponding to the different subsets allows the contributions of many more genes to be taken into account. Indeed, searching through the first 7000 expression levels to find a five-term model, using the same 22 training profiles, resulted in a %MSE of 1.33%, with terms corresponding to genes #6200, 4363, 4599, 4554, and 3697. However, this model was not particularly accurate, misclassifying 4 of the 20 ALL, and 5 of the 14 AML, profiles in the independent set.

Finally, it will be appreciated that the principle of dividing the training profiles into subsets, and finding models for the subsets, then using the models in concert to make the final classification decisions, is not confined to use of FOS, OSM, or any particular model-building technique. For example, a parallel cascade model can be found for each subset as described earlier above, and then the models can be used together to make the final predictions.

PREDICTION OF TREATMENT RESPONSE USING GENE EXPRESSION PROFILES

This section concerns prediction of clinical outcome from gene expression profiles using work in a different area, nonlinear system identification. In particular, the approach can predict long-term treatment response from data of a landmark article by Golub et al. (1999), which to the applicant's knowledge has not previously been achieved with these data. The present paper shows that gene expression profiles taken at time of diagnosis of acute myeloid leukemia contain information predictive of ultimate response to chemotherapy. This was not evident in previous work; indeed the Golub et al. article did not find a set of genes strongly correlated with clinical outcome. However, the present approach can accurately predict outcome class of gene expression profiles even when the genes do not have large differences in expression levels between the classes. Prediction of future clinical outcome, such as treatment response, may be a turning point in improving cancer treatment. This has previously been attempted via a statistically-based technique (Golub et al., 1999) for class prediction based on gene expression monitoring, which showed high accuracy in distinguishing acute lymphoblastic leukemia (ALL) from acute myeloid leukemia (AML). The technique involved selecting "informative genes" strongly correlated with the class distinction to be made, e.g., ALL versus AML, and found families of genes highly correlated with the latter distinction (Golub et al., 1999). Each new tissue sample was classified based on a vote total from the informative genes, provided that a "prediction strength" measure exceeded a predetermined threshold. However, the technique did not find a set of genes strongly correlated with response to chemotherapy, and class predictors of clinical outcome were less successful.

In a sample of 15 adult AML patients treated with anthracycline- cytarabine, eight failed to achieve remission while seven achieved remissions of 46 - 84 months. Golub et al. "found no evidence of a strong multigene expression signature correlated with clinical outcome, although this could reflect the relatively small sample size". While no prediction results for clinical outcome were presented in the paper, they stated that such class predictors were "not highly accurate in cross-validation". Similarly, Schuster et al. (2000, "Tumor Identification by Gene Expression Profiles: A Comparison of Five Different Clustering Methods", Critical Assessment of Microarray Data Analysis CAMDA'OO) could not predict therapy response using the same data in a study of five different clustering techniques: Kohonen-clustering, Fuzzy-Kohonen- network, Growing cell structures, K-means-clustering, and Fuzzy-K-means- clustering. They found that none of the techniques clustered the patients having similar treatment response. The ALL-AML dataset from Golub et al. (1999) was one of two specified for participants in the CAMDA'OO meeting, and to the applicant's knowledge none reported accurate prediction of treatment response with these data.

Prediction of survival or drug response using gene expression profiles can be achieved with microarrays specialized for non-Hodgkin's lymphoma (Alizadeh et al., 2000, "Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling", Nature Vol. 403, 503-511) involving some 18,000 cDNAs, or via cluster analysis of 60 cancer cell lines and correlation of drug sensitivity of the cell lines with their expression profiles (Scherf, U., Ross, D.T., Waltham, M., Smith, L.H., Lee, J.K. & Tanabe, L. et al., 2000, "A gene expression database for the molecular pharmacology of cancer", Nature Genet. Vol. 24, 236-244). Also, using clustering, Alon et al. (1999, "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays", Proc. Natl. Acad. Sci. USA Vol. 96, 6745-6750) showed that tumor and normal classes could be distinguished even when the genes used had small average differences between the classes.

In the present section, it is shown that a technique for modeling nonlinear systems, called parallel cascade identification (PCI) (Korenberg, 1991) can accurately predict class from gene expression profiles even when the genes do not have large differences in expression levels between the classes. In particular, the technique is able to predict long-term treatment response to chemotherapy with anthracycline-cytarabine, which to the applicant's knowledge was not previously possible with the data from Golub et al. The work in this section shows that gene expression profiles taken at time of diagnosis, and lacking a set of genes strongly correlated with clinical outcome, still enable prediction of treatment response otherwise only evident several years later. Although the sample size of the AML treatment response group is not large, there are several reasons why the class prediction method used here, and its performance over this group, is significant for interpreting gene expression profiles. First, since to the applicant's knowledge neither the Golub et al. (1999) method nor any other published to date has accurately predicted treatment response for this group of patients, it may serve as a benchmark for gauging the sensitivity of new methods. Second, the successful predictions of treatment response by the method used here are readily shown to be statistically significant on well-known tests that examined two different aspects of the prediction. Third, the same method was also shown above to perform well on the ALL vs AML task. While the latter class distinction is much easier, the resulting performance indicates that the proposed method for class prediction has more general applicability for interpreting gene expression profiles. Certainly, because of the simplicity of the ALL/AML distinction, minor differences in the number of profiles correctly classified would not necessarily indicate how different methods would compare on more difficult problems such as prediction of treatment response. However, the performance observed here is also consistent with that in classifying protein sequences (Korenberg et al., 2000a) where the method has been successfully tested on thousands of novel sequences (Korenberg et al., 2000b).

To develop a means for predicting clinical outcome from gene expression profiles, begin by viewing the problem as one of nonlinear system identification, using a "black box" approach. Here, the system to be identified is defined by one or more inputs and one or more outputs; the problem is to build a model whose input/output relation approximates that of the system, with no a priori knowledge of the system's structure. Construct a training input by splicing together the expression levels of genes from profiles known to correspond to failed and to successful treatment outcomes. Define the training output as -1 over input segments corresponding to failed outcomes, and 1 over segments corresponding to successful outcomes. The nonlinear system having this input/output relation would clearly function as a classifier, at least for the profiles used in forming the training input. A model is then identified to approximate the defined input/output behavior, and can subsequently be used to predict the class of new expression profiles. Below, only the first failed and first successful outcome profiles were used to construct the training input; the remaining seven failed and six successful outcome profiles served as tests. The same data were used as in Golub et al. (1999). All samples had been obtained at time of leukemia diagnosis. Each profile contained the expression levels of 6817 human genes (Golub et al., 1999), but because of duplicates and additional probes in the Affymetrix microarray, in total 7129 gene expression levels were present in the profile. Nonlinear system identification has already been used for protein family prediction (Korenberg et al., 2000 a,b), and a useful feature of PCI (Korenberg, 1991) is that effective classifiers may be created using very few training data. For example, one exemplar from each of the globin, calcium-binding, and kinase families sufficed to build parallel cascade two-way classifiers that outperformed (Korenberg et al., 2000b), on over 16,000 test sequences, state- of-the-art hidden Markov models trained with the same exemplars. The parallel cascade method and its use in protein sequence classification are reviewed in Korenberg et al. (2001).

While input x(/) to a parallel cascade model will represent the expression levels of genes, both input and output of the model will be treated as if they were time-series data. The rationality of considering the gene expression values as a time series is now justified, in view of the fact that genes in a profile are not ordered sequentially. In fact, lack of actual time dependency causes no problem: PCI simply looks for a pattern in the data. This point is illustrated by the type of training data that could be used to identify the protein sequence classifiers, e.g. Fig. 2(a) of Korenberg et al. (2000b). There, five-digit binary codes were employed to represent each amino acid in a protein sequence, resulting in subtly different plaid-like patterns for different protein families. Though these patterns were artificial in that they depended upon the five-digit codes used, parallel cascade models could be trained to distinguish between the patterns and thus classify novel protein sequences.

For the approach to work, it is necessary that (1) training exemplars from different classes produce different patterns, and (2) the memory length of the nonlinear system to be identified is of appropriate size to "capture" the pattern for each class. Analogously, to learn how to distinguish between two wood grains, say mahogany and cherry, given one table of each, a much smaller sampling window than an entire tabletop would suffice to make a decision. Moreover, sliding the sampling window over both tabletops would provide multiple training examples of each grain.

EXAMPLE 5 The set of failed outcomes was represented by profiles #28 - #33, #50,

#51 of data from Golub et al. (1999), and the set of successful outcomes by profiles #34 - #38, #52, #53. Raw expression levels of 200 selected genes from the first "failed treatment" profile #28 and first "successful treatment" profile #34 were concatenated to form training input x(/) (Fig. 4A). The genes selected were the 200 having greatest difference in expression levels between the two profiles. Order of appending the selected genes resulted in an almost white input (Fig. 4B), which is typically advantageous for nonlinear system identification, including PCI. (The selected genes had the same relative ordering as in the original profiles, and this ordering caused the input autocovariance to be closest to a delta function, out of several orderings tried.) Corresponding training output y(i) was defined as -1 over the "failed treatment" segment of the input and 1 over the "successful treatment" segment (solid line, Fig. 4C). For this input and output, a model was identified using PCI (Korenberg, 1991). The identified model clearly functions as a "predictor" of treatment response, at least for expression profiles #28 and #34. Indeed, when training input x(/) is fed through the parallel cascade model, resulting output z(i) is predominately negative (average value: -0.238) over the "failed treatment" segment, and predominately positive (average value: 0.238) over the "successful treatment" segment of the input (dashed line, Fig. 4C). The identified model had a mean-square error (MSE) of about 74.8%, expressed relative to the variance of the output signal.

Care was taken to ensure that the test sequences were treated independently from the training data. First, the two profiles used to form the training input were never used as test profiles. Second, the set used to determine a few parameters chiefly relating to model architecture never included the profile on which the resulting model was tested. Thus a model was never trained, nor selected as the best of competing models, using data that included the test profile.

In order to identify the parallel cascade model, four parameters relating mostly to its structure had to be pre-specified. These were: (i) the memory length of the dynamic linear element L that began each cascade, (ii) the degree of the polynomial static nonlinearity N that followed, (iii) the maximum number of cascades allowed in the model, and (iv) a threshold concerning the minimum reduction in MSE required before a candidate cascade could be accepted into the model (Korenberg, 1991). How these parameter settings were determined is explained next.

As noted, only two profiles were used to construct the training input, which left 13 profiles for testing. Each time, 12 of these 13 were used to determine values for the above parameters, and then the parallel cascade model having the chosen parameter settings was tested on the remaining ("held out") profile. This process was repeated until each of the 13 profiles had been tested. The 12 profiles used to determine the parameter values will be referred to as the evaluation set, which never included the profile held out for testing. The parameter values were determined each time by finding the choice of memory length, polynomial degree, maximum number of cascades allowed, and threshold that resulted in fewest errors in classifying the 12 profiles. The limit on the number of cascades allowed actually depended on the values of the memory length and polynomial degree in a trial. The limit was set to ensure that the number of variables introduced into the model was significantly less than the number of output points used in the identification. Effective combinations of parameter values did not occur sporadically. Rather, there were ranges of the parameters, e.g. of memory length and threshold values, for which the corresponding models were effective classifiers. When the fewest errors could be achieved by more than one combination of parameter values, then the combination was selected that introduced fewest variables into the model. If there was still more than one such combination, then the combination of values where each was nearest the middle of the effective range for the parameter was chosen. An upper limit of 15 cascades was allowed in the model to ensure that there would be significantly fewer variables introduced than output points used in the identification.

There was in fact one tie, for two different threshold values, in best performance over the 12-profile evaluation set when profile #53 was held out for testing. The fewest errors resulted, with fewest variables introduced into the model (7 cascades), when the memory length was 12, the polynomial degree was 7, and the threshold 7 was either 11 or 12. However, equally good classification over the evaluation set was observed, though with more model variables, for 7 = 10 (13 cascades), whereas the classification degraded considerably for 7 = 13. Hence the threshold 7 = 11 was chosen since a threshold of 12 was closer to the margin for good performance. Each time it was found that it was possible to achieve good performance (2 - 3 errors depending on the 12 profiles in the evaluation set), and employ few cascade variables, if the same four parameter values were used. This meant that the identical parallel cascade model was in fact chosen for each classification of the held out profile. This model had 7 cascades in total, each beginning with a linear element having memory length of 12, followed by a 7^th degree polynomial static nonlinearity. Figure 5A shows the impulse response functions of the linear elements in the 2^nd, 4^th, and 6^th cascades, and 5B the corresponding polynomial static nonlinearities that followed. Each time, the profile held out for testing was classified by appending, in the same order as used above, the raw expression levels of genes in the profile to form an input signal. This input was then fed through the identified model, and its mean output was used to classify the profile. If the mean output was negative, the profile was classified as "failed treatment", and if positive as "successful treatment". This decision criterion was taken from the^" earlier protein classification study (Korenberg et al., 2000a). Results

The parallel cascade model correctly classified 5 of the 7 "failed treatment" (F) profiles and 5 of the 6 "successful treatment" (S) profiles. The corresponding Matthews' correlation coefficient (Matthews, 1975, "Comparison of the predicted and observed secondary structure of T4 phage lysozyme", Biochim. Biophys. Ac, Vol. 405, 442-451) was 0.5476. Two different aspects of the parallel cascade prediction of treatment response were tested, and both times reached statistical significance. First, the relative ordering of profiles from the two outcome types by their model mean outputs was tested by the Mann-Whitney test, a non-parametric test to determine whether the model detected differences between the two profile types. The second aspect of the PCI prediction concerned how well the individual values of the classifier output for the 7 F and 6 S test profiles correlated with the class distinction.

How did the model mean outputs order the test profiles from the two outcome types? If the parallel cascade model did not detect differences between the two types, then the relative ordering of F and S profiles by value of mean output should be random. The parallel cascade mean outputs were ranked as shown in Table 1 , which indeed demonstrates that mean outputs for F profiles tend to be less than those for S profiles. The corresponding Mann- Whitney test statistics were U = 8, U = 34. This meant that the way the parallel cascade model distinguished between F and S profiles was significant at the 0.0367 level on a one-tailed test, for n-i = 7, n₂ = 6. A one-tailed test was used because, due to the way the model had been trained, the mean output was expected to be less for a failed than for a successful outcome profile. Next, the identified parallel cascade model was found to act as a transformation that converts input sequences of 200 gene expression values each into output signals whose individual values correlate with the F vs S class distinction. Because the model had a memory length of 12, the first 11 points of each output signal were excluded to allow the model to "settle", so that the 13 test profiles gave rise to 13 x 189 = 2457 output values. Of these, 1323 output values corresponded to the 7 F, and 1134 to the 6 S, test profiles. For the S profiles, the proportion of output values that were positive was 109% of the corresponding proportion for F profiles, while the S profiles' proportion of output values that were negative was 90% that for F profiles. Indeed, with a Yates- corrected chi-square of 5.13 (P < 0.0235), output values corresponding to test S profiles tended to be positive more often, and negative less often, than those corresponding to test F profiles. Finally, the actual values of the outputs also correlated with the F vs S distinction. The 1323 output values corresponding to the test F profiles totaled -535.93, while the 1134 output values for test S profiles totaled 3209.14. Indeed, point biserial correlation showed that there was a highly significant relation between the actual values of the output signals for the test profiles and the F vs S class distinction (P < 0.0102). Recall that the model's memory length was 12. Hence, limiting the point biserial correlation to every 12^th output value would avoid any overlap in gene expression levels used to obtain such output values. The relation of these 208 output values to the F vs S distinction is still highly significant (P < 0.0155).

PCI is only one approach to predicting treatment response and other methods can certainly be applied. Importantly, it has been shown here to be possible to predict long-term response of AML patients to chemotherapy using the Golub et al. data, and now wider implications can be considered. For example, the method for predicting clinical outcome described here may have broader use in cancer treatment and patient care. In particular, it has recently been shown that there are significant differences in the gene expression profiles of tumors with BRCA1 mutations, tumors with BRCA2 mutations, and sporadic tumors (Hedenfalk et al, 2001 , "Gene-expression profiles in hereditary breast cancer", N. Engl. J. Med., Vol. 344, 539-548). The present method may be used to distinguish the gene expression profiles of these tumor classes, predict recurrence, and assist in selection of treatment regimen.

Table 3 Parallel cascade ranking of test expression profiles

Rank Mean Output Actual Outcome Profile #

1 -1.17 F 31

2 -0.863 F 32

3 -0.757 F 33

4 -0.408 S 37

5 -0.298 F 50

6 -0.0046 F 30

7 0.0273 S 53

8 0.078 S 38

9 0.110 F 51

10 0J48 F 29

11 0.194 S 52

12 0.267 S 36

13 16.82 S 35

F = failed treatment, S = successful treatment. The complete set of profiles is found in http://www-enome.wi.mit.edu/MPR/data_set_ALL_AML.html (see Golub et al, 1999) and "Profile #" follows the same numbering scheme.

Examples 1 — 3 and 5 can be briefly summarized by the following steps:

1. Compare the gene expression levels in the training profiles and select a set of genes that assist in distinguishing between the classes. Each profile contained 7129 expression values (from 6817 human genes plus duplicates and additional probes) but only 200 values were selected from each profile. Consider first the 2-class distinction, with one training profile available from each class. For each selected gene, the absolute value of the difference between the corresponding raw values on the two training profiles ranked in the top 200 of all the genes. When there were more training profiles from each class, say 15 exemplars from the acute lymphoblastic leukemia (ALL) class and 11 from the acute myeloid leukemia (AML) class, then the first 4 AML training profiles were repeated to givel 5 AML training profiles. For each gene, the mean of its raw expression levels was computed over the 15 ALL training profiles, and the mean was also computed over the 15 AML training profiles. Then the absolute value of the difference between the two means was computed for the gene. The 200 genes having the largest of such absolute values were selected. If instead the model is to distinguish π-classes, n > 2, a criterion could be based on a sum of absolute values of pairwise differences between the means of a gene's expression levels, where each mean is computed over the training profiles for a class.

2. Append values representative of the expression levels of the selected genes from a given profile to produce a segment representative of the class of that profile. Repeat for each profile, maintaining the same order of appending the values representative of the expression levels. For predicting treatment response in Example 5, the raw expression levels as given in the data downloaded from the MIT website http://www-genome.wi.mit.edu/MPR/data set ALL AML.html were the values appended. For the ALL-AML distinction task, in Example 3 where multiple training profiles were used from each class, logarithms (to base 10) of the raw expression levels were used, after increasing to 100 any smaller raw level. The order used to append the values is preferably chosen to assist in distinguishing between the classes. Not all orders are equally good. It was observed that an appending order that causes the training input autocovariance to be nearly a delta function tends to be beneficial. One method for finding such an appending order is through appropriate use of a stochastic minimization technique involving trial interchanges within a segment, while maintaining the same order in all segments. This technique is readily obtained by adapting the method described in Hunter and Kearney, 1983, "Generation of random sequences with jointly specified probability density and autocorrelation functions", Biol. Cybern., Vol. 47, pp. 141-146. A preferred way to choose an appending order is by checking the resulting model's classification accuracy over an evaluation set of known profiles from the classes to be distinguished. In fact, in Examples 1 - 3 and 5, the selected values were always appended in the same relative order that they appeared in the full profile.

3. Concatenate the representative segments to form a training input.

4. Define an input/output relation by creating a training output having values corresponding to the input values, where the output has a different value over each representative segment from a different class. ^{" ■ ■ " ■ .} - - - 5. Identify a parallel cascade model (FIG. 1) to approximate the input/output relation.

Classify a new gene expression profile by (a) appending the expression levels of the same genes selected above, in the same order as above, to produce a segment for input into the identified parallel cascade model; (b) applying the segment to the parallel cascade model and obtaining the corresponding output; and (c) using the output to make a prediction of the class of the new expression profile. One decision criterion, for the two-class case is: if the mean of the parallel cascade output is less than zero, then assign the profile to the first class, and otherwise to the second class. Another criterion (used in Example 3) is based on certain ratios of mean square error (MSE). This criterion compares the MSE of the model output z(/) from -1 , relative to the corresponding MSE over the ALL training segments, with the MSE of z(i) from 1 , relative to the MSE over the AML training segments.

In the above, models have been built to distinguish between two or more classes of interest. However, it will be appreciated that separate models could instead be built for each class using PCI, FOS, OSM, or other model- building techniques. One way to do so is, for each class, to use at least one profile exemplar to obtain a training input comprising a sequence of values.

Next, for each class, obtain a training output by shifting the input signal to advance the sequence. Then, for each class, find a finite-dimensional system to approximate the relation between the training input and output for that class.

Once the models have been built for each class, a query profile (i.e., a profile whose class is to be determined) can be classified in one of two ways. First, an input signal and an output signal can be made from the query profile, then the input is fed through each of the models for the classes, and the model outputs are compared with the output derived from the query profile. The closest "fit" determines the class, using a criterion of similarity such as minimum Euclidean distance. Second, the input and output signals derived from the query profile can be used to find a model, which is then compared with the class models, and the closest one determines the classification of the query profile.

It will be appreciated that the class predictors described herein can be combined with other predictors, such as that of Golub et al. (1999), nearest neighbor classifiers, classification trees, and diagonal linear discriminant analysis.

The above material has described use of nonlinear system identification techniques such as parallel cascade identification, fast orthogonal search, the orthogonal search method, and other methods of model building to interpret gene expression profiles. However, it will be appreciated that the present invention also applies to many other diagnostic profiles representative of biological information, such as proteomics data. Proteomics techniques, for example the use of two-dimensional electrophoresis (2-DE), enables the analysis of hundreds or thousands of proteins in complex mixtures such as whole-cell lysates. (See http://www.uku.fi/~lehesran/proteomics6.htm) Proteome analysis is an effective means of studying protein expression, and has an advantage over use of gene expression profiles, since mRNA levels do not correlate very highly with protein levels. Protein separation through use of 2-DE gels occurs as follows. In the first dimension, proteins are separated by their iso-electric points in a pH gradient. In the second dimension, proteins are separated according to their molecular weights. The resulting 2-DE image can be analyzed, and quantitative values obtained for individual spots in the image. Protein profiles may show differences due to different conditions such as disease states, and comparing profiles can detect proteins that are differently expressed. Such proteomics data can also be interpreted using the present invention, e.g. for diagnosis of disease or prediction of clinical outcome.

(B) PROTEIN SEQUENCE CLASSIFICATION AND RECOGNITION OF

ACTIVE SITES, SUCH AS PHOSPHORYLATION AND ATP-BINDING

SITES, ON PROTEINS

A recent paper (Korenberg et al., 2000a) introduced the approach of using nonlinear system identification as a means for automatically classifying protein sequences into their structure/function families. The particular technique utilized, known as parallel cascade identification (PCI), could train classifiers on a very limited set of exemplars from the protein families to be distinguished and still achieve impressively good two-way classifications. For the nonlinear system classifiers to have numerical inputs, each amino acid in the protein was mapped into a corresponding hydrophobicity value, and the resulting hydrophobicity profile was used in place of the primary amino acid sequence. While the ensuing classification accuracy was gratifying, the use of (Rose scale) hydrophobicity values had some disadvantages. These included representing multiple amino acids by the same value, weighting some amino acids more heavily than others, and covering a narrow numerical range, resulting in a poor input for system identification. Another paper (Korenberg et al., 2000b) introduced binary and multilevel sequence codes to represent amino acids, for use in protein classification. The new binary and multilevel sequences, which are still able to encode information such as hydrophobicity, polarity, and charge, avoid the above disadvantages and increase classification accuracy. Indeed, over a much larger test set than in the original study, parallel cascade models using numerical profiles constructed with the new codes achieved slightly higher two-way classification rates than did hidden Markov models (HMM) using the primary amino acid sequences, and combining PCI and HMM approaches increased accuracy. There are at least two areas where the PCI method can be usefully employed in protein sequence classification. First, it may be an aid to individual scientists engaged in various aspects of protein research. This is because the method can create effective classifiers after training on very few exemplars from the families to be distinguished, particularly when binary (two- way) decisions are required. This can be an advantage, for instance, to researchers who have newly discovered an active site on a protein, have only a few examples of it, and wish to accelerate their search for more by screening novel sequences. Second, as discussed below, the classifiers produced by the approach have the potential of being usefully employed with hidden Markov models to enhance classification accuracy.

In order to have some comparison of performance when both HMM and PCI approaches receive either the primary amino acid sequences or equivalent information, a new scheme for coding the amino acids was used in Korenberg et al. (2000b). The scheme equally weights each amino acid, and causes greater variability in the numerical profiles representing the protein sequences, resulting in better inputs for system identification. At the same time, the relative ranking imposed by the Rose scale can be almost completely preserved. The scheme is similar to, but more elaborate than, one used in recent work on distinguishing exon from intron DNA sequences (M.J. Korenberg, E.D. Lipson, and J.E. Solomon, "Parallel Cascade Recognition of Exon and Intron DNA Sequences", appended hereto as Appendix C). In the latter problem, it was necessary to find a numerical representation for the bases adenine (A), thymine (T), guanine (G), and cytosine (C). In work concerned with efficiently comparing two DNA sequences via crosscorrelation (Cheever et al., 1989, "Using Signal Processing Techniques for DNA Sequence Comparison", in Proc. Northeastern Bioengineeing Symposium, Boston, MA, pp. 173-174) it was suggested that the bases A, T, G, C be represented by respectively 1 , -1 , /, - , where /^' = V-ϊ . Directly using these complex numbers to represent the bases in DNA sequences would require two inputs (for the real and imaginary parts). In order to avoid the need for two-input parallel cascade models, this representation was modified, so that bases A, T, G, C in a sequence were encoded respectively by ordered pairs (0, 1), (0, -1), (1 , 0), (-1 , 0). This doubled the length of the sequence, but allowed use of a single real input. Note that here purines A, G are represented by pairs of same sign, as are pyrimidines C, T. Provided that this biochemical criterion was met, good classification would result. Also, many other binary representations were explored, such as those using only ±1 as entries, but it was found that within a given pair, the entries should not change sign. For example, representing a base by (1 , -1) did not result in a good classifier. The same approach was applied in Korenberg et al. (2000b) to develop numerical codes for representing each of the amino acids. Thus, within the code for an amino acid, the entries should not change sign. To represent the 20 amino acids, each of the codes could have 5 entries, three of them 0, and the other two both 1 or both -1. There are (₂) = 10 such codes of each sign, so the 20 amino acids can be uniquely coded this way. Next, the codes are preferably not randomly assigned to the amino acids, but rather in a manner that adheres to a relevant biochemical property. Consequently, the amino acids were ranked according to the Rose hydrophobicity scale (breaking ties), and then the codes were assigned in descending value according to the binary numbers corresponding to the codes.

The resulting scale in Table 1 , where the bottom half is the negative mirror image of the top half, will be referred to as SARAH1 ("Simultaneously Axially and Radially Aligned Hydrophobicities"). In a similar scale, SARAH2 in Table 2, each code again has 5 entries, but here two of them are 0, while the other three all equal 1 or all equal -1.

Using one of these scales to encode a protein sequence yields a numerical profile 5 times longer than the original sequence, but one where each amino acid is uniquely represented and equally weighted. Alternatively, either of the scales can be employed to translate a protein sequence into a profile consisting of 5 signals, which leads to use of 5-input parallel cascade models (Korenberg, 1991). The greater range covered by the new numerical profiles, compared with the (Rose scale) hydrophobicity profiles, results in improved inputs for training the parallel cascade models, and more accurate classification as shown below. While the above scales carry information about hydrophobicity, scales can similarly be constructed to imbed other chemical or physical properties of the amino acids such as polarity, charge, alpha-helical preference, and residue volume. Since each time the same binary codes are assigned to the amino acids, but in an order dependent upon their ranking by a particular property, the relative significance of various factors in the protein folding process can be studied in this way. It is clear that randomly assigning the binary codes to the amino acids does not result in effective parallel cascade classifiers. In addition, the codes can be concatenated to carry information about a number of properties. In this case, the composite code for an amino acid can have 1 , - 1 , and 0 entries, and so can be a multilevel rather than binary representation.

The application of nonlinear system identification to automatic classification of protein sequences was introduced in Korenberg et al. (2000a). Briefly, begin by choosing representative sequences from two or more of the families to be distinguished, and represent each sequence by a profile corresponding to a property such as hydrophobicity or to amino acid sequence. Then splice these profiles together to form a training input, and define the corresponding training output to have a different value over each family or set of families that the classifier is intended to recognize.

For example, consider building a binary classifier intended to distinguish between calcium-binding and kinase families using their numerical profiles constructed according to the SARAH 1 scale. The system to be constructed is shown in Fig. 1 , and comprises a parallel array of cascades of dynamic linear and static nonlinear elements. We start by splicing together the profiles for the calcium-binding sequence 1SCP and kinase sequence 1 PFK, to form a 4940 point training input x(i) . The input has this length because the 1SCP and 1PFK sequences have 348 and 640 amino acids respectively and, as the SARAH 1 scale is used in this example, each amino acid is replaced with a code 5 digits long. However, as noted above, the scale could have instead been used to create 5 signals, each 988 points in length, for a 5-input parallel cascade model. No preprocessing of the data is employed. Define the corresponding training output y(i) to be -1 over the calcium-binding, and 1 over the kinase, portions of the input. We now seek a dynamic (i.e., has memory) nonlinear system which, when stimulated by the training input, will produce the training output. Clearly, such a system would function as a binary classifier, and at least would be able to distinguish apart the calcium-binding and the kinase representatives.

We can build an approximation to such a dynamic system, given the training input and output, using techniques from nonlinear system identification.

The particular technique we have used in this paper, called parallel cascade identification, is more fully explained in Korenberg (1991), and is summarized below.

Suppose that _x i (/)₅ / = 0,...,/ . are the training input and output (in this example, / = 4939). Then parallel cascade identification is a technique for approximating the dynamic nonlinear system having input „(/) and output y(i) by a sum of cascades of alternating dynamic linear (L) and static nonlinear (N) elements.

The parallel cascade identification method (Korenberg, 1991) can be outlined as follows. A first cascade of dynamic linear and static nonlinear elements is found to approximate the dynamic nonlinear system. The residual, i.e., the difference between the system and the cascade outputs, is calculated, and treated as the output of a new dynamic nonlinear system. A cascade of dynamic linear and static nonlinear elements is now found to approximate the new system, the new residual is computed, and so on. These cascades are found in such a way as to drive the crosscorrelations of the input with the residual to zero. It can be shown that any dynamic nonlinear discrete-time system having a Volterra or a Wiener functional expansion can be approximated, to an arbitrary degree of accuracy in the mean-square sense, by a sum of a sufficient number of the cascades (Korenberg, 1991). For generality of approximation, it suffices if each cascade comprises a dynamic linear element L followed by a static nonlinearity N, and this L N structure was used in the present work, and is assumed in the algorithm description given immediately below.

In more detail, suppose that y_k (i) denotes the residual after the -th cascade has been added to the model, with y (ϊ) = y(i) ■ Let z_k(i) be the output of the /c-th cascade. Then, for /c = 1 , 2,... , ( = -ι(i)-«*(0-

(1)

Assume that the output y(i) depends on input values x(i),x(i-l), ... , x(i -R) , i.e., upon input lags 0,...,R . There are many ways to determine a suitable (discrete) impulse response function h_k (j), j = 0,...,R for the linear element L beginning the /c-th cascade (Korenberg, 1991). One practical solution is to set it equal to either the first order crosscorrelation of the input χ(i) with the current residual y_k_ι (ϊ) , or to a slice of a higher order input

/ residual crosscorrelation, with discrete impulses added or subtracted at diagonal values. Thus, set

(2) if the first order input residual crosscorrelation φ is used, or _k(j) = φ_xxyk__ι (j,A) ± Cδ(j-A)

(3) if the second order crosscorrelation φ is used, and similarly if a ^{r x}*Λ-ι higher order crosscorrelation is instead employed. In Eq. (3), the discrete impulse function δ(j -A) = l, j = A _> ^ar|d equals 0 otherwise. If Eq. (3) is used, then ^ j_s fj_xecι _a one of 0,...,R, the sign of the δ - term is chosen at random, and C is adjusted to tend to zero as the mean-square of the residual v, , approaches zero. If the third order input residual crosscorrelation φ were used, then we would have analogously _k(j) = φ_xxxyk__ι (j,A₁,A₂) ±C_lδ(j-A₁) ±C₂δ(j -A₂)

(4) where A_l,A₂,C₁,C₂ are defined similarly to A,C in Eq. (3).

However, it will be appreciated that many other means can be used to determine the h_k (j) , and that the approach is not limited to use of slices of crosscorrelations. More details of the parallel cascade approach are given in

Korenberg (1991). Once the impulse response h_k (j) has been determined, the linear element's output u_k (ϊ) can be calculated as

R u_k(ϊ) = h_k(j)x(i-j) .

(5)

In Eq. (5), the input lags needed to obtain the linear element's output range from 0 to R, so its memory length is R+1.

The signal u_k(ϊ) is itself the input to a static nonlinearity in the cascade, which may be represented by a polynomial. Since each of the parallel cascades in the present work comprised a dynamic linear element L followed by a static nonlinearity N, the latter's output is the cascade output

D z_k (i) = a_kdu_k ^d(i) . d=0

(6)

The coefficients a_kd defining the polynomial static nonlinearity N may be found by best-fitting, in the least-square sense, the output z_k(i) to the current residual y_k__λ(ϊ) . Once the /c-th cascade has been determined, the new residual y_k (i) can be obtained from Eq. (1), and because the coefficients a_kd were obtained by best-fitting, the mean-square of the new residual is

(7) where the overbar denotes the mean (time average) operation. Thus, adding a further cascade to the model reduces the mean-square of the residual by an amount equal to the mean-square of the cascade output (Korenberg, 1991). This, of course, by itself does not guarantee that the mean-square error (MSE) of approximation can be made arbitrarily small by adding sufficient cascades. However, the cascades can be created in such a way as to drive the input / residual crosscorrelations arbitrarily close to zero for a sufficient number of cascades. This is the central point that guarantees convergence to the least-square solution, for a given memory length R+1 of the dynamic linear element and polynomial degree D of the static nonlinearity. It implies that in the noise-free case a discrete-time system having a Volterra or a Wiener functional expansion can be approximated arbitrarily accurately by a finite sum of these LN cascades. For a proof of convergence, see Korenberg (1991).

Once the parallel cascade model has been identified, we calculate its output due to the training input, and also the MSE of this output from the training output for calcium-binding and kinase portions of the training input. Recall that the training output has value -1 over the calcium-binding portion, and 1 over the kinase portion, of the training input. Hence we compute a first MSE of the model output from -1 for the calcium-binding portion, and a second MSE from 1 for the kinase portion, of the training input.

The parallel cascade model can now function as a binary classifier via an MSE ratio test. A sequence to be classified, in the form of its numerical profile χ(i) constructed according to the SARAH1 scale, is fed to the model, and we calculate the corresponding output

ft=l

(8) where K is the number of cascade paths in the final model. Next, we compare the MSE of _z(i) fr°^m -1. relative to the corresponding MSE for the appropriate training sequence, with the MSE of _z(j) from 1 , again relative to the MSE for the appropriate training sequence. In more detail, the first ratio is

^e\

(9) where e_t is the MSE of the parallel cascade output from -1 for the training numerical profile corresponding to calcium-binding sequence 1SCP. The second ratio computed is

(10) where e₂ is the MSE of the parallel cascade output from 1 corresponding to kinase sequence 1 PFK. Each time an MSE is computed, we commence the averaging on the (R+7)-th point. If r, is less than r₂ , the sequence is classified as calcium-binding, and if greater, as kinase. This MSE ratio test has also been found to be an effective classification criterion in distinguishing exon from intron DNA sequences (Korenberg, Lipson, Solomon, unpublished). As noted below, an effective memory length R+1 for our binary classifiers was 125, corresponding to a primary amino acid sequence length of 25, which was therefore the minimum length of the sequences which could be classified by the models identified in the present example.

Other decision criteria may be used. For example, distributions of output values corresponding to each training input may be computed. Then, to classify a novel sequence, compute the distribution of output values corresponding to that sequence, and choose the training distribution from which it has the highest probability of coming. However, only the MSE ratio criterion just discussed was used to obtain the results in the present example.

Note that, instead of splicing together only one representative sequence from each family to be distinguished, several representatives from each family can be joined (Korenberg et al., 2000a). It is preferable, when carrying out the identification, to exclude from computation those output points corresponding to the first R points of each segment joined to form the training input. This is done to avoid introducing error into the identification due to the transition zones where the different segments of the training input are spliced together.

By means of this parallel cascade identification and the appropriate training input and output created as described earlier, three classifier models were found, each intended to distinguish between one pair of families. For example, a parallel cascade model was identified to approximate the input / output relation defined to distinguish calcium-binding from kinase sequences. The three models corresponded to the same assumed values for certain parameters, namely the memory length R+1, the polynomial degree D, the maximum number of cascades permitted in the model, and a threshold for deciding whether a cascade's reduction of the MSE justified its inclusion in the model. To be acceptable, a cascade's reduction of the MSE, divided by the mean-square of the current residual, had to exceed the threshold J divided by the number of output points /₁ used to estimate the cascade, or equivalently:

(1 1)

This criterion (Korenberg, 1991) for selecting candidate cascades was derived from a standard correlation test.

The parallel cascade models were identified using the training data for training calcium-binding vs kinase classifiers, or on corresponding data for training globin vs calcium-binding or globin vs kinase classifiers. Each time the same assumed parameter values were used, the particular combination of which was analogous to that used in the DNA study. In the latter work, it was found that an effective parallel cascade model for distinguishing exons from introns could be identified when the memory length was 50, the degree of each polynomial was 4, and the threshold was 50, with 9 cascades in the final model. Since in the DNA study the bases are represented by ordered pairs, whereas here the amino acids are coded by 5-tuples, the analogous memory length in the present application is 125. Also, the shortest of the three training inputs here was 4600 points long, compared with 818 points for the DNA study. Due to the scaling factor of 5/2 reflecting the code length change, a roughly analogous limit here is 20 cascades in the final models for the protein sequence classifiers. In summary, the default parameter values used in the present example were memory length (R+1) of 125, polynomial degree D of 4, threshold T of 50, and a limit of 20 cascades. As shown in Figure 2b of Korenberg (2000b), when the training input of Fig. 2a of that paper is fed through the calcium-binding vs kinase classifier, the resulting output is indeed predominately negative over the calcium-binding portion, and positive over the kinase portion, of the input. The next section concerns how the identified parallel cascade models performed over the test sets. CLASSIFICATION RESULTS FOR TEST SEQUENCES

All results presented here are for two-way classification, based on training with a single exemplar from the globin, calcium-binding, and kinase families.

Original Test Set Used in Korenberg et al. (2000a)

The detailed results are shown in Table 3 for the SARAH1 and SARAH2 encoding schemes, with correct classifications by PCI averaging 85% and 86% respectively. These should be compared with a 79% average success rate in the earlier study (Korenberg et al., 2000a) where Rose scale hydrophobicity values were instead used to represent the amino acids.

Large Test Set The detailed results are shown in Table 4 for PCI using the SARAH 1 encoding scheme, for the hidden Markov modeling approach using the Seq u e n ce Al ig n ment an d M odel i ng System (SAM) (http://www.cse.ucsc.edu/research/compbio/sam.html), and for the combination of SAM with PCI. The SARAH2 scheme was not tested on this set. The combination was implemented as follows. When the HMM probabilities for the two families to be distinguished were very close to each other, the SAM classification was considered to be marginal, and the PCI result was substituted. The cutoff used with SAM was 1 in the NLL-NULL score, which is the negative log of the probability of a match. This cutoff was determined using the original test set of 253 protein sequences used in Korenberg et al. (1991). The extent to which the PCI result replaced that from SAM depended on the pair of families involved in the classification task, and ranged from 20-80% with an average of 60%.

Parallel cascade identification has a role in protein sequence classification, especially when simple two-way distinctions are useful, or if little training data is available. Binary and multilevel codes were introduced in Korenberg et al. (2000b) so that each amino acid is uniquely represented and equally weighted. The codes enhance classification accuracy by causing greater variability in the numerical profiles for the protein sequences and thus improved inputs for system identification, compared with using Rose scale hydrophobicity values to represent the amino acids. Parallel cascade identification can also be used to locate phosphorylation and ATPase binding sites on proteins, applications readily posed as binary classification problems.

(c) PREDICTING WHETHER A MOLECULE WILL EXHIBIT

BIOLOGICAL ACTIVITY, E.G., IN DRUG DISCOVERY, INCLUDING

THE SCREENING OF DATABASES OF SMALL MOLECULES TO

IDENTIFY MOLECULES OF POSSIBLE PHARMACEUTICAL USE In "Textual And Chemical Information Processing: Different Domains

B u t S i m i l a r A l g o r i t h m s "

(http://www.shef.ac.uk/~is/publications/infres/paper69.html). Peter Willett at Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Sheffield S10 2TN, UK, describes a genetic algorithm for determining whether a molecule is likely to exhibit biological activity. Relevant properties such as molecular weight, the _k (a shape index) and the number of aromatic rings, rotatable bonds, hydrogen-bond donor atoms and hydrogen-bond acceptor atoms. Each property (also called a feature) was represented numerically by a value that was allowed to fall within one of 20 bins allocated for that property. The genetic algorithm was used to calculate a weight for each bin of each property, based on a training set of compounds for which the biological activities are available.

The same approach described in this application for predicting the class of gene expression profiles, or for classifying protein sequences or finding active sites on a protein can be used to determine whether a molecule will possess biological activity. For each compound in the training set, the numerical values for the relevant properties can be appended to form a segment, always following the same order of appending the values. A training input can then be constructed by concatenating the segments. The training output can then be defined to have a value over each segment of the training input that is representative of the biological activity of the compound corresponding to that segment. Parallel cascade identification or another model-building technique can then be used to approximate the input/output relation. Then a query compound can be assessed for biological activity by appending numerical values for the relevant properties, in the same order as used above, to form a segment which can be fed to the identified model. The resulting model output can then be used to classify the query compound as to its biological activity using some test of similarity, such as sign of the output mean (Korenberg et al., 2000a) or the mean-square error ratio (Korenberg et al., 2000b).

(d) DISTINGUISHING EXON FROM INTRON DNA AND RNA

SEQUENCES, AND DETERMINING THEIR BOUNDARIES This application is described in Appendix C.

(e) Combining Non-Linear System Identifications Methods with Other Methods

As noted above, the method described by Golub et al. provided strong predictions with 100% accurate results for 29 of 34 samples in a second data set after 28 ALL and AML profiles in a first set were used for training. The remaining 5 samples in the second set were not strongly predicted to be members of the ALL or AML classes. The non-linear method of the present invention may be combined with Golub's method to provide predictions for such samples which do not receive a strong prediction. In particular, Golub's method may first be applied to a sample to be tested. Golub's method will provide weighted votes of a set of informative genes and a prediction strength. Samples that receive a prediction strength below a selected threshold may then be used with the parallel cascade indentification technique model described above to obtain a prediction of the sample' s classification.

Additional Embodiments

1. The identified parallel cascade model can be used to generate "intermediate signals" as output by feeding the model each of the segments used to form the training input. These intermediate signals can themselves be regarded as training exemplars, and used to find a new parallel cascade model for distinguishing between the corresponding classes of the intermediate signals. Several iterations of this process can be used. To classify a query sequence, all the parallel cascade models would need to be used in the proper order.

2. Note that while the mean and MSE ratios have been discussed in particular as means for making classification decisions, combinations of these quantities can be used. In addition, in place of using a mean-square test of similarity to each class (as inuse of the MSE ratios), analogous tests using absolute values or a higher power than 2 can be employed.

Appendix A: Korenberg et al., 2000a, "Parallel Cascade Identification as a Means for Automatically Classifying Protein Sequences into Structure/Function Groups", vol. 82, pp. 15-21

Biol. Cybern. 82, 15-21 (2000)

Biological Cybernetics

Parallel cascade identification as a means for automatically classifying protein sequences into structure/function groups

Michael Korenberg¹, Jerry E. Solomon², Moira E. Regelson² . .

' Department or Electrical and Computer Engineering, Queen's University, Kingston, Ontario, K7 ₃N6, Canada Center for Computational Biology, The Beckman Institute, California Institute of Technology, Pasadena, CA 91125

Received: 16 March 1998 /Accepted in revised form: 2 June 1999

Abstract. Current methods for automatically classifying Here, in a pilot study, we propose classifying protein protein sequences into structure/function- groups, based sequences by means of a technique for modeling dyon their hydrophobicity profiles, have typically required namic nonlinear systems, known as parallel cascade large training sets. The most successful of these, methods identification (Korenberg 1991), and show that it is are based on hidden Markov models, but may require capable of highly accurate performance in experimental hundreds of exemplars for training in order to obtain trials. This method appears to be equally or more disconsistent results. In this paper, we describe a new criminating than other techniques (Krogh et al. 1994; approach, based on nonlinear system identification, McLachlan 1993; Regelson 1997; Stultz et al. 1993; which appears to require little training data to achieve White et al. 1994) when the size of the training sets is highly promising results. limited. Remarkably, when only a single sequence from each of the glόbϊή, calcium-bindmg,~aHd"kinase families was used to train the parallel cascade models, an overall two-way classification accuracy of about 79% was obtained in classifying novel sequences from the

1 Introduction families.

This paper is addressed to managers of large protein

Stochastic modeling of hydrophobicity profiles of prodatabases to inform them of an emerging methodology tein sequences has led to methods being proposed for for automatic classification of protein sequences. It is automatically classifying the sequences into structure/ believed that the proposed method can usefully be emfunction groups (Baldi et al. 1994; Krogh et al. 1994; ployed to supplement currently used classification techRegelson 1997; Stultz et al. 1993; White et al. 1994). niques, such as those based on hidden Markov models. Various linear modeling techniques for protein sequence This is because, as detailed below, the new method ap-- classification, including a Fourier domain approach pears to require only a few representative protein se(McLachlan 1993), have been suggested but most have quences from families to be. distinguished in order to not shown impressive performance may in experimental construct effective classifiers. Hence, for each classificatrials (about 70% in two-way classifications). A hidden tion task, the accuracy may be enhanced by building Markov modeling approach (Baldi et al. 1994; Krogh numerous classifiers, each trained on different exemplars et al. 1994; Regelson 1997) has been more effective in from the protein families, and have these classifiers vote classifying protein sequences, but its performance may ■on new classification decisions (see Discussion). Because depend on the availability of large training sets and the proposed method appears to be effective while difrequire a lengthy training time. This is because thoufering considerably from that of hidden Markov models, sands of parameters might be incorporated into each there are likely benefits from employing them, together. hidden Markov model to obtain reasonable classificaFor example, when the two methods reach the same tion accuracy (Baldi et al. 1994; Regelson 1997). Hence, classification decision, there may well be increased cona potential drawback of the hidden Markov modeling fidence in the result. approach to classifying proteins is the possible need of It is also hoped that individual scientists involved in using large training sets (i.e., hundreds of exemplars) in various aspects of protein research will be interested in order to obtain consistent results (Regelson 1997). the new approach to automatically classifying protein sequences. For this reason, some detail is provided about the parallel cascade identification method, and

Correspondence to: . Korenberg also the construction of one particular protein sequence

(e-mail: ^• korenber@post.queensu.ca classifier (globin versus calcium-binding) is elaborated

Tel.: + 1-613-5452931, Fax: + 1-613-5456615) upon as an example. 2 System and methods yk(t) = j%-i(0 - zj.(0 (1)

For managers of large protein databases, discussion of where y₀(i) = y(i). The parallel cascade output, z(j), will the modest requirements of computer memory and be the sum of the individual cascade outputs z_k(i). The processing time is not crucial. However, the individual (discrete) impulse response function of the dynamic researcher who might wish to investigate this approach linear element beginning each cascade can, optionally, further may be interested in knowing that the protein be defined using a first-order (or a slice of a higher- sequence classification algorithm was implemented (in order) crosscorrelation of the input with the latest Turbo Basic source code) on a 90-MHz Pentium. Only residue (discrete impulses δ are added at diagonal values 640 kilo-bytes of local memory (RAM) are actually when higher-order crosscorrelations are utilized). When required for efficient operation. Training times were only constructing the Ath cascade, note that the latest residue a few seconds while subsequent classification of a new is Λ-ι(i)- Thus; the impulse response function /»*( ), protein sequence could be made in a fraction of a j = 0, ... ,R, for the linear element beginning the Λth second. cascade will have the form

Our data consisted of hydrophobicity profiles of sequences from globin, calcium-binding protein, and Λ*C/) = <fc>.-,(/) (2) protein kinase families. The particular mapping of if the first-order input/residue crosscorrelation ψ _t is amino acid to hydrophobicity utilized is the "Rose" used, or scale shown in Table 3 of Cornette et al. (1987), and the resulting profiles were not normalized. The protein seh_tJD _^ Φ^U ± CδV-A) (3) quences were taken from the Brookhaveπ Protein Data Base of known protein structures. . if the second-order crosscorrelation Φ_xxy^, is used, and similarly if a higher-order crosscorrelation is instead employed (Korenberg 1991). In (3), the discrete impulse

2.1 Algorithm description function _•(/^' — A) = 1 if j = A, and equals 0 otherwise. If (3) is used, then A is fixed at one of the values Q, ... ,R, and

Consider a .discretertime dynamic .(i.e.-, has memory) C is adjusted to tend to zero as the mean-square of the nonlinear system described as a "black box", so that the residue approaches zero. The linear element's output is only available information about the system is^' its input x i) and output y{i), i = 0, ... ,/. Parallel cascade U (ή = h_k(J)x(i-j) identification (Korenberg 1991) is a method for con(4) structing an approximation, to an arbitrary degree of accuracy, of the system's input/output relation using a Next, the static nonlinearity, in the form of a polynosum of cascaded elements, when the system has a Wiener mial, can be best-fitted, in the least-square sense, to the or Volterra functional expansion. Each cascade path residue y_k-_\ (.). If a higher-degree (say, >5) polynomial is comprises alternating dynamic linear and static nonlinto be best-fitted, then for increased accuracy scale the ear elements, and the parallel array can be built up one linear element so that its output, «jt(z), which is the input cascade at a time in the following way. to the polynomial, has unity mean-square. If D is the

A first cascade of dynamic linear and static nonlinear degree of the polynomial, then the output of the static elements is found to approximate the input/output renonlinearity, and hence the cascade output, has the form lation of the nonlinear system to be identified. The residue - i.e., the difference between the system and the cascade outputs - is treated as the output of a new dy(5) namic nonlinear system, and a second cascade is found

to approximate the latter system. The new residue is The new residue is then calculated from (1). Since the computed, a third cascade can be found to improve the polynomial in (5) was least-square fitted to the residue approximation, and so on. These cascades are found in y_k-ι{i), it can readily be shown that-the mean-square of such a way as to drive the input/residue crosscorrela- the new residue yt(i) is tions to zero. It can be shown (Korenberg 1991) that any nonlinear system having a Volterra or Wiener functional expansion (Wiener 1958) can be approximated to an ^j*(Q =^j*-ι(0 - 4<0 (6) arbitrary degree of accuracy in the mean-square sense by where the bars denote the mean (time average) operaa sum of a sufficient number of the cascades. For gention. Thus, adding a further cascade to the model erality of approximation, it suffices if each cascade reduces the mean-square of the residue by an amount comprises a dynamic linear element followed by a static equal to the mean-square of the cascade output. nonlinearity, and this cascade structure was used in the In the simple procedure used in the present study, the present work. However, additional alternating dynamic impulse response of the dynamic linear element beginlinear and static nonlinear elements could optionally be ning each cascade was defined using a slice of a cross- inserted into any cascade path. correlation function, as just described. Alternatively, a

If jj_fc(ι') denotes the residue after adding the ifcth casnonlinear mean-square error (MSE) minimization techcade, then for k > 1, nique can be used to best-fit the dynamic linear and static nonlinear elements in a cascade to the residue one from the calcium-binding family (Brookhaven (Korenberg 1991). Then, the new residue is computed, designation lscp, with 348 amino acids), and one from the rmnimization technique is used again to best-fit anthe general kinase family (Brookhaven designation lpfk, other cascade, and so on. This is much faster than using with 640 amino acids). These profiles are unusually long an MSE minimization technique to best-fit all cascades since they are multidomain representatives of their at once. A variety of such minimization techniques, e.g., respective families, e.g., the average length of globin the Levenberg-Marquardt procedure (Press et al. 1992), family proteins is about 175 amino acids. The lengthier are available, and the particular choice of minimization profiles were selected to enable construction of suffitechnique is not crucial to the parallel cascade approach. ciently elaborated parallel cascade models. AlternativeThe central idea here is that each cascade can be chosen ly, one could of course concatenate a number of profiles to minimize the remaining MSE (Korenberg 1991) such from the same family together, but we were interested in that crosscorrelations of the input with the residue are exploring the information content of single profiles. driven to zero. Alternatively, various iterative proceOnly two-way (i.e., binary) classifiers were condures can be used to successively update the dynamic structed in the present work; a multistate classifier can linear and static nonlinear elements, to increase the rereadily be realized by an arrangement of binary classiduction in MSE attained by adding the cascade to the fiers. To build a parallel cascade classifier to distinguish model. However, such procedures were not needed in between, say, globin and calcium-binding protein famithe present study to obtain good results. lies, begin by splicing together the two selected profiles

A key benefit of the parallel cascade architecture is from these families (forming a 920-point training input). that all the memory components reside in the dynamic Define the corresponding training output to be -1 over linear elements, while the nonlinearities are localized in the globin portion and 1 over the calcium-binding static functions. Hence, approximating a dynamic sysportion of the input. The system to be constructed is tem with higher-order nonlinearities merely requires esshown in block-diagram form in Fig. 1, and comprises a timating higher-degree polynomials in the cascades. This parallel cascade model followed by an averager. Figis much faster, and numerically more stable than, say, ure 2a shows the input and corresponding output used approximating the system with a functional expansion to train the globin versus calcium-binding classifier. and estimating its higher-order kernels. Nonlinear sysThe input output data were used to build the parallel tem identification techniques are finding a variety of cascade model, but a number of basic parameters had to interesting applications and, for example, are currently be chosen. These were the memory length of the dynamic being used to detect deterministic dynamics in experilinear element beginning each cascade, the degree of the mental time series (Barahona and Poon 1996; Korenberg polynomial which followed, the maximum number of 1991). However, the connection of nonlinear system cascades permitted in the model, and a threshold based identification with classifying protein sequences appears on a correlation test for deciding whether a cascade's to be entirely new and surprisingly effective, and is reduction of the MSE justified its addition to the model. achieved as follows. These parameters were set by testing the effectiveness of

Suppose that we form an input signal by concatecorresponding identified parallel cascade models in nating one or more representative hydrophobicity proclassifying sequences from a small verification set. files from each of two families of protein sequences to be This set comprised 14 globin, 10 calcium-binding, and ^' distinguished. We define the corresponding output sig11 kinase sequences, not used to identify the parallel nal to have a value of -1 over each input segment from cascade models. It was found that effective models were the first family, and a value of 1 over each segment from produced when the memory length was 25 for the linear the second. The fictitious nonlinear system which could elements (i.e., their outputs depended on input lags map the input into the output would function as a 0, ... , 24), the degree of the polynomials was 5 for globin classifier. Nothing is known about this nonlinear system versus calcium-binding, and 7 for globin versus kinase or save its input and output, which suggests resorting to a calcium-binding versus kinase classifiers, with 20 cas"black box" identification procedure. Moreover, the cades per model. A cascade was accepted into the model ability of parallel cascade identification to conveniently only if its reduction of the MSE, divided by the mean- approximate dynamic systems with high-order nonlinsquare of the previous residue, exceeded a specified earities can be crucial for developing an accurate clasthreshold divided by the number of output points used sifier and, in fact, this approach proved to be effective, as to fit the cascade (Korenberg 1991). For globin versus is shown next. calcium-binding and calcium-binding versus kinase classifiers, this threshold was set at 4 (roughly corresponding to a 95% confidence interval were the residue-

3 Implementation independent Gaussian noise), and for the globin versus kinase classifier the threshold was 14. For a chosen

Using the parallel-cascade approach, we wished to memory length of 25, each parallel cascade model would investigate how much information about a structure/ have a settling time of 24, so we excluded from the function family could be carried by one protein sequence identification those output points corresponding to the in the form of its hydrophobicity profile. Therefore, we first 24 points of each distinct segment joined to form the selected one protein sequence from the globin family input. The choices made for memory length, polynomial (Brookhaven designation lhds, with 572 amino acids), degree, and maximum number of cascades ensured that

Fig. 1. Use of a parallel cascade model to classify a protein sequence into one of two families. Each £ is a dynamic linear element with settling time (i.e., maximum input lag) R, and each JY is a static nonlinearity. The protein sequence in the form of a hydrophobicity profile - (j'), t' = 0, ... ,/, is fed to the parallel cascade model, and the average model output z is computed. If 2 is less than zero, the sequence is classified in the first family, and if greater than zero in the second family

Fig. 2. a The training input and output used to identify the parallel cascade model for distinguishing globin from calcium-binding sequences. The input -c(ι) was formed by splicing together the hydrophobicity profiles of one representative globin and calcium-binding sequence. The output y(i) was defined to be -1 over the globin portion of the input, and 1 over the calcium-binding portion, b The training output y(i) and the calculated output r(i) of the identified parallel cascade model evoked by the training input of (a). Note that the calculated output tends to be negative (average value: -0.5 ) over the globin portion of the input, and positive (average value: 0.19) over the calcium-binding portion there were fewer variables introduced into a parallel tained. Using the training input and output of Fig. 2a, cascade model than output points used to obtain the we identified a parallel cascade model via the procedure model. Training times ranged from about 2 s (for a (Korenberg 1991) described above, for assumed values threshold of 4) to about 8 s (for a threshold of 14). of memory length, polynomial degree, threshold, and

In more detail, consider how the classifier to distinmaximum number of cascades allowable. We then tested guish globin^ from calcium-binding sequences was obthe identified model for its ability to differentiate between globin and calcium-binding sequences from the averaging on the 25th point). The sign of this average small verification set. We observed that the obtained determines the decision of the binary classifier (see models were not good classifiers unless the assumed Fig. 1). More sophisticated decision criteria are under memory length was at least 25, so this smallest effective active investigation, but were not used to obtain the value was selected for the memory length. present results. Over the verification set, the globin

For this choice of memory length, the best globin versus calcium-binding classifier, as noted, recognized all versus calcium-bindi g classification resulted when the 14 globin and 9 of the 10 calcium-binding sequences. polynomial degree was 5 and the threshold was 4, or The globin versus kinase classifier recognized 12 of 14 when the polynomial degree was 7 and the threshold was globin, and 10 of 11 kinase sequences. The calcium- 14. Both these classifiers recognized all 14 globin and 9 binding versus kinase classifier recognized all 10 calciof 10 calcium-binding sequences in the verification set. um-binding and 9 of the 11 kinase sequences. The same In contrast, for example, the model found for a polybinary classifiers were then appraised over a larger test nomial degree of 7 and threshold of 4 misclassified one set comprising 150 globin, 46 calcium-binding, and 57 globin and two calcium-binding sequences. Hence, to kinase sequences, which did not include the three seobtain the simplest effective model, a polynomial degree quences used to construct the classifiers. The globin of 5 and threshold of 4 were chosen. There are two versus calcium-binding classifier correctly identified 96% reasons for setting the polynomial degree to be the (144) of the globin and about 85% (39) of the calcium- minimum effective value. First, this reduces the number binding hydrophobicity profiles. The globin versus kiof parameters introduced into the parallel cascade nase classifier correctly identified about 89% (133) of the model. Second, there are less numerical difficulties in globin and 72% (41) of the kinase profiles. The calcium- fitting lower-degree polynomials. Indeed, extensive testbinding versus kinase classifier correctly identified about ing has shown that when two models perform equally 61% (28) of the calcium-binding and 74% (42) of the well on a verification set, the model witfi the lower-dekinase profiles. Interestingly, a blind test of this classifier gree polynomials usually performs better on a new test had been conducted since five hydrophobicity profiles set. had originally been placed in the directories for both the

It remained to set the maximum number of cascades calcium-binding and the kinase families. The classifier to be allowed in the model. Since the selected memory correctly identified each of these profiles as belonging to length was 25 and polynomial degree was 5, each dythe calcium-binding family. Out of the 506 two-way namic linear/static nonlinear cascade path added to the classification decisions made by the parallel cascade model introduced a further 25 + 6 = 31 parameters. models on the test set, 427 (=84%) were correct, but the As noted earlier, the training input and output each large number of globin sequences present skews the recomprised 920 points, and we excluded from the idensult. If an equal number of test sequences had been tification those output points corresponding to the first available from each family, the overall two-way classi24 points of each segment joined to form the input. This fication accuracy expected would be about 79%. The left 872 output points available for identifying the partwo-way classification accuracies for globin, calcium- allel cascade model, which must exceed the number of binding, and kinase sequences (i.e., the amount correctly (independent) parameters to be introduced. Hence at identified of each group) were about 92%, 73%, and most 28 cascade paths could be justified, but to allow 73%, respectively. These success rates cannot be directly some redundancy, a maximum of 20 cascades were alcompared with those reported in the literature (Krogh lowed. This permitted 620 parameters in the model, et al. 1994; Regelson 1997) for the hidden Markov slightly more than 70% of the number of output points model approach because the latter attained such success used for the identification. While normally such a high rates with vastly more training data than required in our amount of parameters is inappropriate, it could be tolstudy (see Discussion). erated here because of the low-noise "experimental" Using 2 x 2 contingency tables, we computed the chi- conditions. That is, well-established hydrophobicity square statistic for the classification results of each of the profiles were spliced to form the training input, and the three binary classifiers over the larger test set. The null training output, defined to have the values -1 and 1, was hypothesis states that the classification criterion used by precisely known as explained above. a parallel cascade model is independent of the classifi¬

In this way, we employed the small verification set to cation according to structure/function. For the null determine values of memory length, polynomial degree, hypothesis to be rejected at the 0.001 level of significance threshold, and maximum number of cascades allowable, requires a chi-square value of 10.8 or larger. The comfor the parallel cascade model to distinguish globin from puted chi-square values for the globin versus calcium- calcium-binding sequences. Recall that the training inbinding, globin versus kinase, and calcium-binding put and output of Fig. 2a had been used to identify the versus kinase classifiers are =129, =75, and 12.497, model. Figure 2b shows that the calculated output of the respectively, indicating high statistical significance. identified model, when stimulated by the training input, How does the length of a protein sequence affect its indeed tends to be negative over the globin portion of classification? For the 150 test globin sequences, the the input, and positive over the calcium-binding portion. average length ( ± the sample standard deviation σ) A test hydrophobicity profile input to a parallel caswas 148.3 ( ± 15.1) amino acids. For the globin versus cade model is classified by computing the average of the calcium-binding and globin versus kinase classifiers, the resulting output post settling time (i.e., commencing the average length of a misclassified globin sequence was 108.7 ( ± 36.4) and 152.7 ( ± 24) amino acids, respecsent the corresponding protein sequence. The families to tively, the average length of correctly classified globin be distinguished, namely globin, calcium-binding, kisequences was 150 ( ± 10.7) and 147.8 ( ± 13.5), renase, and a "random" group drawn from 12 other spectively. The globin versus calcium-binding classifier classes, were represented by over 900 training sequences, misclassified only six globin sequences, and it is difficult with calcium-binding having the smallest number, 116. to draw a conclusion from such a small number, while Successful classification rates on novel test sequences, the other classifier misclassified 17 globin sequences. using trained left-to-right hidden Markov models, Accordingly, it is not clear that globin sequence length ranged over 92-97% for kinase, globin, and "random" significantly affected classification accuracy. classes, and was a little less than 50% for calcium-

Protein sequence length did appear to influence calbinding proteins (Table 4.30 in Regelson 1997). These cium-binding classification accuracy. For the 46 test results illustrate that, with sufficiently large training sets,' calcium-binding sequences, the average length was 221.2 left-to-right hidden Markov models are very well suited ( ± 186.8) amino acids. The average length of a misto distinguishing between a number of structural/ classified calcium-binding sequence, for the globin verfunctional classes of protein (Regelson 1997). sus calcium-binding and calcium-binding versus kinase It was also clearly demonstrated that the size of the classifiers, was 499.7 ( ± 294.5) with seven sequences training set strongly influenced generalization to the test misclassified, and 376.8 ( ± 218) with 18 misclassified, set by the hidden Markov models (Regelson 1997). For respectively. The corresponding average lengths of corexample, in other of Regelson's experiments, the kinase rectly classified calcium-binding sequences were 171.2 training set comprised 55 short sequences (128-256 ( ± 95.8) and 121.1 ( ± 34.5), respectively, for these amino acids each) represented by transformed property classifiers. profiles, which included power components from Rose

Finally, for the 57 test kinase sequences, the average scale hydrophobicity profiles. All of these training selength was 204.7 ( ± 132.5) amino acids. The average quences could subsequently be recognized, but none of length of a misclassified kinase sequence, for globin the sequences in the test set (Table 4.23 in Regelson versus kinase and calcium-binding versus kinase 1997), so that 55 training sequences from one class were classifiers, was 159.5 ( ± 137.3) with 16 sequences misstill insufficient to achieve class recognition. classified, and 134.9 ( ± 64.9) with 15 misclassified, reThe protein sequences in our study are a randomly spectively. The corresponding average lengths of selected subset of the profiles used by Regelson (1997). correctly classified kinase sequences, for these classifiers, The results reported above for parallel cascade classifiwere222.4 ( ± 126.2) and 229.7 ( ± 141.2), respectively. cation of protein sequences surpass those attained by

Thus, sequence length may have affected classification various linear modeling techniques described in the litaccuracy for calcium-binding and kinase families, with erature. A direct comparison with the hidden Markov average length of correctly classified sequences being modeling approach has yet to be done based on the shorter than and longer than, respectively, that of inamount of training data used in our study. While three correctly classified sequences from the same family. protein sequence hydrophobicity profiles were used to However, neither the correctly classified nor the misconstruct the training data for the parallel cascade classified sequences of each family could be assumed to models, an additional 35 profiles forming our verificacome from normally distributed populations, and the tion set were utilized to gauge the effectiveness of trial number of misclassified sequences was, each time, much values of memory length, polynomial degree, number of less than 30. For these reasons, statistical tests to decascades, and thresholds. However, useful hidden Martermine whether differences in mean length of correctly kov models might not be trainable on only 38 hydroclassified versus misclassified sequences are significant phobicity profiles in total, and indeed it is clear from will be postponed to a future study with a larger range of Regelson (1997) that several hundred profiles could sequence data. Nevertheless, the observed differences in sometimes be required for training to obtain consistent means of correctly classified and misclassified sequences, results. for both calcium-binding and kinase families, suggest Therefore, for the amount of training data in our that classification accuracy may be enhanced by training pilot study, parallel cascade classifiers appear to be with several representatives of these families. Two alcomparable to other currently available protein seternative ways of doing this are discussed in the next quence classifiers. It remains open how parallel cascade section. and hidden Markov model performance compare using the large training sets often utilized for the latter approach. However, because the two approaches differ

4 Discussion greatly, they may tend to make their classification errors on different sequences, and so might be used together to

The most effective current approach for protein seenhance accuracy. quence classification into structure/function groups uses Several questions and observations are suggested by hidden Markov models, a detailed investigation of the results of our pilot study so far. Why does a memory which was undertaken by Regelson (1997). Some of length of 25 appear to be optimal for the classifiers? her experiments utilized hydrophobicity profiles (Rose Considering that hydrophobicity is a major driving force scale, normalized) from each of which the 128 most in folding (Dill 1990) and that hydrophobic-hydropho- significant power components were extracted to reprebic interactions may frequently occur between amino acids which are well separated along the sequence, but the method in recognizing calcium-binding and kinase nearby topologically, it is not surprising that a relatively proteins, but was less evidently so for globins. This long memory may be required to capture this informasuggested that using further calcium-binding and kinase tion. It is also known from autoregressive moving avexemplars of differing lengths in training the parallel erage (ARMA) model studies (Sun and Parthasarathy cascade classifiers may be especially important to in. 1 94) that hydrophobicity profiles exhibit a high degree crease classification accuracy. of long-range correlation. Further, the apparent domiThe present work appears to confirm that hydronance of hydrophobicity in the protein folding process phobicity profiles store significant information conprobably accounts for the fact that hydrophobicity cerning structure and/or function as was observed by profiles carry a considerable amount of information reRegelson (1997). Our work also indicates that even a garding a particular structural class. It is also interesting single protein sequence may reveal much about the to note that the globin family in particular exhibits a characteristics of the whole family, and that parallel high degree of sequence diversity, yet our parallel cascascade identification is a particularly efficient means of cade models were especially accurate in recognizing- extracting characteristics which distinguish the families. members of this family. This suggests that the models We are now exploring the use of parallel cascade idendeveloped here are detecting structural information in tification to distinguish between coding (exon) and the hydrophobicity profiles. non-coding (intron) DNA or RNA sequences. Direct

In future work, we will construct multi-state classifiapplications of this work are both in locating genes and ers, formed by training with an input of linked hydroincreasing our understanding of how RNA is spliced in phobicity profiles representing, say, three distinct making proteins. families, and an output which assumes values of, say, -1, 0, and 1 to correspond with the different families Acknowledgements. Supported in part by grants from the Natural represented. This work will consider the full range of Sciences and Engineering Research Council of Canada. We thank sequence data available in the Swiss-Prot sequence data the referees for their astute comments on the manuscript. base. We will compare the performance of such multi- state classifiers with those realized by an arrangement of binary classifiers. In addition, we will investigate the References improvement in performance afforded by training with an input having a number of representative profiles from Baldi P, Chauvin Y, Hunkapiller T, McClure MA (1994) Hidden each of the families to be distinguished. An alternative Markov models of biological primary sequence information. strategy to explore is identifying several parallel cascade Proc Natl Acad Sci USA 91:1059-1063

Barahona M, Poon C-S (1996) Detection of nonlinear dynamics in classifiers, each trained for the same discrimination task, short, noisy time series. Nature 381:215-217 using a different single representative from each family Cornette JL, Cease KB, Margalit H, Spouge JL, Berzofsky JA, to be distinguished. It can be shown that, if the classifiers De isi C (1987) Hydrophobicity scales and computational do not tend to make the same mistakes, and if each techniques for detecting amphipathic structures in proteins. classifier is right most of the time, then the accuracy can J Mol Biol 195:659-685 be enhanced by having the classifiers vote on each deDill KA (1990) Dominant forces in protein folding. Biochemistry 29:7133-7155 cision. To date, training times have only been a few Korenberg MJ (1991) Parallel cascade identification and kernel seconds on a 90-MHz Pentium, so there is some latitude estimation for nonlinear systems. Ann Biomed Eng 19:429- for use of lengthier and more elaborate training inputs, 455 and/or training several classifiers for each task. Krogh A, Brown M, Mian IS, Sjolander K, Haussler D (1994)

The advantage of the proposed approach is that it Hidden Markov models in computational biology - applicadoes not require any a. priori knowledge about which tions to protein modeling, J Mol Biol 235:1501-1531

McLachlan AD (1993) Multichannel Fourier analysis of patterns in features distinguish one protein family from another. protein sequences. J Phys Chem 97:3000-3006 However, this might also be a disadvantage because, due Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) to its generality, it is not yet clear how close proteins of Numerical recipes in C:the art of scientific computing, 2nd edn. different families can be to each other and still be disCambridge University Press, Cambridge tinguishable by the method. Additional work will inRegelson ME (1997) Protein structure/function classification using vestigate, as an example, whether the approach can be hidden Markov models. Ph.D. Thesis, The Beckman Institute, California Institute of Technology, Pasadena used to identify new members of the CIC chloride Stultz CM, White JV, Smith TF (1 93) Structural analysis based.on channel family, and will look for the inevitable limitastate-space modeling. Protein Sci 2:305-314 tions of the method. For instance, does it matter if the Sun SJ, Parthasarathy R (1994) Protein-sequence and structure hydrophobic domains form alpha helices or beta relationship ARMA spectral-analysis - application to memstrands? What kinds of sequences are particularly easy brane-proteins. Biophys J 66:2092-2106 or difficult to classify? How does the size of a protein White JV, Stultz CM, Smith TF (1994) Protein classification by stochastic modeling and optimal filtering of ammo-acid-seaffect its classification? We began an investigation of the quences. Math Biosci 119:35-75 latter question in this paper, and it appeared that seWiener N (1958) Nonlinear problems in random theory. MIT quence length was a factor influencing the accuracy of Press, Cambridge, Mass Appendix B: Korenberg et al., 2000b, "Automatic Classification of Protein Sequences into Structure/Function Groups via Parallel Cascade Identification: A Feasibility Study", Annals of Biomedical Engineering, vol. 28, pp. 803-811

Automatic Classification of Protein Sequences into Structure/Function Groups via Parallel Cascade Identification: A Feasibility Study

MICHAEL J. KORENBERG,¹ ROBERT DAVID,² IAN W. HUNTER,² and JERRY E. SOLOMON³

'Department of Electrical and Computer Engineering, Queen's University, Kingston, Ontario, Canada, department of Mechanical

Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Room 3-154, Cambridge, MA, and

³Ccnter for Computational Biology, The Beckman Institute, California Institute of Technology, Pasadena, CA

(Received 1 My 1999; accepted 13 June 2000)

Abstract — A recent paper introduced the approach of using Proposed approaches to this problem include both linear nonlinear system identification as a means for automatically methods such as Fourier domain analysis¹⁰ and variants classifying protein sequences into their structure/function families. The particular technique utilized, known as parallel casof hidden Markov modeling.^1,9'¹² The latter approaches cade identification (PCI), could (rain classifiers on a very limhave been the most effective at automatically classifying ited set of exemplars from the protein families to be protein sequences, but in some applications'² have redistinguished and still achieve impressively good two-way clasquired large training sets (hundreds of exemplars from sifications. For the nonlinear system classifiers to have numerithe families to be distinguished) in order to obtain concal inputs, each amino acid in the protein was mapped into a corresponding hydrophobicity value, and the resulting hydrosistent results. Other times these approaches have perphobicity profile was used in place of the primary amino acid formed respectably with very little training data (e.g., see sequence. While the ensuing classification accuracy was gratitests below). Certainly, when the training sets are suffifying, the use of (Rose scale) hydrophobicity values had some ciently large, the hidden Markov model (HMM) disadvantages. These included representing multiple amino acapproaches^1-9,12 have yielded very impressive classificaids by the same value, weighting some amino acids more heavily than others, and covering a narrow numerical range, tion results over many families of protein sequences. resulting in a poor input for system identification. This paper In the present paper, we examine a recent approach,⁸ introduces binary and multilevel sequence codes to represent based on parallel cascade identification^5,6 (PCI), which amino acids, for use in protein classification. The new binary can also be used to classify protein sequences. It is not and multilevel sequences, which are still able to encode inforintended that this approach would ever replace the HMM mation such as hydrophobicity, polarity, and charge, avoid the above disadvantages and increase classification accuracy. Inmethods as the preferred classification means for managdeed, over a much larger test set than in the original study, ers of large databases. However, there are at least two parallel cascade models using numerical profiles constructed areas where the PCI method can be usefully employed. with the new codes achieved slightly higher two-way classifiFirst, it may be an aid to individual scientists engaged in cation rates than did hidden Markov models (HMMs) using the various aspects of protein research. This is because the primary amino acid sequences, and combining PCI and HMM approaches increased accuracy. © 2000 Biomedical Engineermethod can create effective classifiers after training on ing Society. [S0090-6964(00)00607-X] very few exemplars from the families to be distinguished, particularly when binary (two-way) decisions are required. This can be an advantage, for instance, to

Keyword! — Protein sequence classification, Nonlinear system researchers who have newly discovered an active site on identification, Binary sequences, SARAH codes a protein, have only a few examples of it, and wish to accelerate their search for more by screening novel se¬

INTRODUCTION quences. Second, as discussed below, the classifiers produced by the approach have the potential of being use¬

Automatic classification of protein sequences into fully employed with HMMs to enhance classification their structure/function groups, based either upon priaccuracy. mary amino acid sequence, or upon property profiles The approach in question was introduced in a recent such as hydrophobicity, polarity, and charge, has atpaper⁸ that proposed to use techniques from nonlinear tracted increasing attention over the past 15 yr.¹'⁸-¹⁰-¹²-'⁴ system identification in order to classify protein sequences into their structure/function families. The par¬

Address correspondence to Michael J. Korenberg, Department of ticular technique employed, called parallel cascade EIectric-1 and Computer Engineering, Queen's University, Kingston, identification,^5,6 was used to build binary classifiers for Ontario K7L 3N6, Canada. Electronic mail: korenber@post.qu-ensu.ca distinguishing protein sequences from the globin,

803 KORENBERG et al. calcium-binding, and kinase families. Using a single exSYSTEM AND METHODS ample of a hydrophobicity profile from each of these families for training, the parallel cascade classifiers were The protein sequence classification algorithm⁸ was able to achieve two-way classification accuracies averagimplemented in Turbo Basic on 166 MHz Pentium MMX and 400 MHz Pentium II computers. Due to the ing about 79% on a test set.⁸ manner used to encode the sequence of amino acids,

While these findings were highly promising, the intraining times were lengthier than when hydrophobicity vestigation was essentially a pilot study with a limited values were employed, but were generally only a few number (253) of test sequences. In addition, the mapping minutes long, while subsequently a sequence could be from amino acid to hydrophobicity value was determined classified by a trained model in only a few seconds or according to the Rose scale.³ Because this scale is not less. Compared to hidden Markov models, parallel casone-to-one, its use entails a loss of information about the cade models trained faster, but required about the same primary amino acid sequence, with the 20 amino acids amount of time to classify new sequences. being mapped into 14 hydrophobicity values. Moreover, Sequences from globin, calcium-binding protein, and the differing magnitudes of these values cause some protein kinase families were converted to numerical proamino acids to be weighted more heavily than others. files using the scheme set out below. The following data Finally, the values have a narrow range, from 0.52 to sets were used in our study: 0.91, making the resulting hydrophobicity profiles poor (i) The training set, identical to that from the earlier inputs for nonlinear system identification. study,⁸ comprised one sequence each from globin,

The present paper describes a more thorough and rigcalcium-binding, and general kinase families, having respective Brookhaven designations 1HDS (with 572 orous investigation of the performance of parallel casamino acids), 1SCP (with 348 amino acids), and 1PFK cade classification of protein sequences. In. particular, we (with 640 amino acids). This set was used to train a utilized more than 16,000 globin, calcium-binding, and parallel cascade model for distinguishing between each kinase sequences from the NCBI (National Center for pair of these sequences, as described in the next section. Biotechnology Information, at ncbi.alm.nih.gov) data(ϋ) The first (original) test set comprised ISO globin, base to form a much larger set for testing. In order to 46 calcium-binding, and 57 kinase sequences, which bad avoid biasing the present study toward a higher success been selected at random from the Brookhaven Protein rate, no attempt was made to search for "better" training Data Bank (now at rcsb.org) of known protein structures. sequences, and in fact only the three primary amino acid This set was identical to the test set used in the earlier sequences from the earlier study⁸ were used here for study.* creating the training inputs. (iii) The second (large) test set comprised 1016

However, instead of relying on hydrophobicity values, globin, 1864 calcium-binding, and 13,264 kinase sethe present paper introduces the use of a binary or mulquences from the NCBI database, all having distinct pritilevel numerical sequence to code each amino acid mary amino acid sequences. The sequences for this test uniquely. The coded sequences are contrived to weight set were chosen exhaustively by keyword search. As each amino acid equally, and can be assigned to reflect a explained below, only protein sequences with at least 25 relative ranking in a property such as hydrophobicity, amino acids could be classified by the particular parallel polarity, or charge. Moreover, codes assigned using difcascade models constructed in the present paper, so this was the minimum length of the sequences in our test ferent properties can be concatenated, so that each comsets. posite coded sequence carries information about the amino acid's rankings in a number of properties. Binary and Multilevel Sequence Representations

The codes cause the resulting numerical profiles for for Amino Acids the protein sequences to form improved inputs for sys¬

Hydrophobicity profiles constructed according to the tem identification. As shown below, using the new amino Rose scale³ capture significant information concerning acid codes, parallel cascade classifiers were more accustructure and/or function, enabling them to be used sucrate (85%) than were hydrophobicity-based classifiers in cessfully in protein classification.^8,12 However, as obthe earlier study,¹ and over the large test set achieved served above, the Rose scale is not a one-to-one mapcorrect two-way classification rates averaging 79%. For ping, so that different amino acid sequences can have the same three training exemplars, hidden Markov modidentical hydrophobicity profiles. Moreover, the scale els using primary amino acid sequences averaged 75% covers a narrow range of values, while causing some accuracy. We also show that parallel cascade models can amino acids to be weighted more heavily than others. be used in combination with hidden Markov models to These characteristics make the profiles relatively poor increase the success rate to 82%. inputs for nonlinear system identification. In addition, Parallel Cascade Classification of Protein Sequences software is publicly available for using hidden Markov TABLE 1. SARAHl scale. models to classify protein sequences, not by their hydro- Amino acid Binary code phobicir profiles, but rather by their primary amino acid sequences, e.g., the well-known sequence alignment and C 1,1,0,0,0 F 1,0,1,0,0 modeling (SAM) system of Hughey and Krogh.⁴ I 1,0,0,1,0

In order to have some comparison of performance V 1,0,0,0,1 when both HMM and PCI approaches receive either the L 0,1,1,0,0 primary amino acid sequences or equivalent information, W 0,1,0,1,0 a new scheme for coding the amino acids was used. The M 0,1,0,0,1 H 0,0,1,1,0 scheme equally weights each amino acid, and causes Y 0,0,1,0,1 greater variability in the numerical profiles representing A 0,0,0,1,1 the protein sequences, resulting in better inputs for sysG 0,0,0,-1,-1 tem identification. At the same time, the relative ranking T 0,0,-1,0,-1 imposed by the Rose scale can be almost completely s 0,0,-1,-1,0

R 0,-1,0,0,-1 preserved. The scheme is similar to, but more elaborate P 0,-1,0,-1,0 than, one used in recent work on distinguishing exon N 0,-1,-1,0,0 from intron DNA sequences.⁷ D -1,0,0,0,-1

In the latter problem, it was necessary to find a nuQ -1,0,0.-1,0 merical representation for the bases adenine (A), thymine E -1,0,-1,0,0 K -1,-1,0,0,0 (7), guanine (G), and cytosine (C). In work concerned ith efficiently comparing two DNA sequences via crosscorrelation, Cheever et al.² had suggested representis the negative mirror image of the top half, will be ing the bases A, T, G, C by, respectively, 1, —1, i, — i, referred to as SARAHl ("simultaneously axially and where i= V— T. Directly using these complex numbers radially aligned hydrophobicities"). In a similar scale, to represent the bases in DNA sequences would require SARAH2 in Table 2, each code again has five entries, two inputs (for the real and imaginary parts). In order to but here two of them are 0, while the other three all avoid the need for two-input parallel cascade models, equal 1 or all equal —1. this representation was modified,⁷ so that bases A, T, G, Using one of these scales to encode a protein seC in a sequence were encoded, respectively, by ordered quence yields a numerical profile five times longer than pairs (0, 1), (0, -1), (1, 0), (-1, 0). This doubled the the original sequence, but one where each amino acid is length of the sequence, but allowed use of a single real uniquely represented and equally weighted. Alternainput. Note that here purines A. G are represented by tively, either of the scales can be employed to translate a pairs of the same sign, as are pyrimidines C, T. Provided protein sequence into a profile consisting of five signals, that this biochemical criterion was met, good classification would result.⁷ Also, many other binary representations were explored, such as those using only ±1 as TABLE 2. SARAH2 sole. entries, but it was found that within a given pair, the Amino acid Binary code entries should not change sign.⁷ For example, representing a base by (1, —1) did not result in a good classifier. C 1,1,1.0,0

F 1.1,0,1,0 The same approach was applied in the present paper / 1,1.0,0,1 to develop numerical codes for representing each of the V 1,0,1,1,0 amino acids.- Thus, within the code for an amino acid, L .1.0.1,0.1 the entries should not change sign. To represent the 20 W 1,0,0.1.1 amino adds, each of the codes could have 'five entries, U 0,1,1,1,0

H 0,1,1,0,1 three of them 0, and the other two both 1 or both —1. Y 0,1,0,1,1 There are (|) = 10 such codes of each sign, so the 20 A 0,0,1,1.1 amino acids can be uniquely coded this way. Next, the G 0,0,-1,-1,-1 codes are preferably not randomly assigned to the amino T 0,-1,0,-1,-1 acids, but rather in a manner that adheres to a relevant s 0,-1,-1,0,-1

R 0,-1.-1,-1,0 biochemical property. Consequently, the amino acids P -1,0,0,-1,-1 were ranked according to the Rose hydrophobicity scale N -1,0,-1,0.-1 (breaking ties), and then the codes were assigned in D -1,0,-1,-1,0 descending value according to the binary numbers corO -1,-1,0,0,-1 responding to the codes. ε -1,-1,0,-1,0

K -1,-1,-1,0,0

The resulting scale in Table 1, where the bottom half KORENBERG et al. but this approach, which leads to use of five-input parallel cascade models,⁶ will be deferred to a future paper. The greater range covered by the new numerical profiles, compared with the (Rose scale) hydrophobicity profiles, results in improved inputs for training the parallel cascade models, and more accurate classification as shown

below. FIGURE 1. The parallel cascade model used to classify pro¬

While the above scales cany information about hytein sequences: each L is a dynamic linear element, and drophobicity, scales can similarly be constructed to imeach N is a polynomial static nonlinearity. bed other chemical or physical properties of the amino acids such as polarity, charge, -helical preference, and residue volume. Since each time the same binary codes queπces have 348 and 640 amino acids respectively and, are assigned to the amino acids, but in an order depenas the SARAHl scale is used in this paper, each amino dent upon their ranking by a particular property, the acid is replaced with a code five digits long. However, as relative significance of various factors in the protein noted above, the scale could have instead been used to folding process can be studied in this way. It is clear that create five signals, each 988 points in length, for a five- randomly assigning the binary codes to the amino acids input parallel cascade model. No preprocessing of the does not result in effective parallel cascade classifiers. In data is employed. Define the corresponding training outaddition, the codes can be concatenated to carry inforput (i) to be —1 over the calcium-binding, and 1 over mation about a number of properties. In this case, the the kinase, portions of the input [Fig. 2(a)]. We now composite code for an amino acid can have 1, -1, and 0 seek a dynamic (i.e., has memory) nonlinear system entries, and so can be a multilevel rather than binary which, when stimulated by the training input, will prorepresentation. duce the training output. Clearly, such a system would

Finally, not only could binary code representations be function as a binary classifier, and at least would be able used here and in the DNA work,⁷ but also, as shown in to distinguish apart the calcium-binding and the kinase the next section, analogous combinations of default parepresentatives. rameters for training parallel cascade models were emWe can build an approximation to such a dynamic ployable in the two studies. Moreover, the same criterion system, given the training input and output, using techfor making classification decisions could be utilized in niques from nonlinear system identification. The particuboth applications. lar technique we have used in this paper, called parallel cascade identification, is more fully explained in Refs. 5

BUILDING THE PARALLEL CASCADE and 6, and is summarized below. CLASSIFIERS Suppose that x(i), y(i), i=0,...,/, are the training input and output (in this example, /=4939). Then paral¬

The application of nonlinear system identification to lel cascade identification is a technique for approximatautomatic classification of protein sequences was introing the dynamic nonlinear system having input x(i) and duced in the earlier study.⁸ Briefly, we begin by choosoutput y(t) by a sum of cascades of alternating dynamic ing representative sequences from two or more of the linear L and static nonlinear N elements. families to be distinguished, and represent each sequence Previously, a parallel cascade model consisting of a by a profile corresponding to a property such as hydrofinite sum of dynamic linear, static nonlinear, and dyphobicity or to amino acid sequence. Then we splice namic linear (i.e., LNL) cascades was introduced by these profiles together to form a training input, and dePalm" to uniformly approximate discrete-time systems fine the corresponding training output to have a different that could be approximated by Volterra series. In Palm's value over each family or set of families that the classiparallel LNL model, the static nonlinearities were expofier is intended to recognize. nential or logarithmic functions. The dynamic linear el¬

For example, consider building a binary classifier inements were allowed to have anticipation as well as tended to distinguish between calcium-binding and kimemory. While his architecture was an important contrinase families using their numerical profiles constructed bution, Palm¹¹ did not describe any technique for conaccording to the SARAHl scale. The system to be constructing, from input/output data, a parallel cascade apstructed is shown in Fig. 1, and comprises a parallel proximation for an unknown dynamic nonlinear system. array of cascades of dynamic linear and static nonlinear Subsequently, Korenberg^5,6 introduced a parallel caselements. We start by splicing together the profiles for cade model in which each cascade comprised a dynamic the calcium-binding sequence 1SCP and kinase sequence linear element followed by a polynomial static nonlin1PFK, to form a 4940 point training input jc(i). The earity (Fig. 1). He also provided a procedure for finding input has this length because the 1SCP and 1PFK se- such a parallel LN model, given suitable input/output Parallel Cascade Classification of Protein Sequences

(b)

data, to approximate within an arbitrary accuracy in the ity of approximation, it suffices if each cascade commean-square sense any discrete-time system having a prises a dynamic linear element L followed by a static Wiener¹⁵ functional expansion. While LN cascades sufnonlinearity N, and this LN structure was used in the ficed, further alternating L and N elements could optionpresent work, and is assumed in the algorithm descripally he added to the cascades. It was also shown⁶ that tion given immediately below. any discrete-time finite memory, finite order Volterra seIn more detail, suppose that y_t(ι) denotes the residual ries could be exactly represented by a finite sum of these after the fcth cascade has been added to the model, with LN cascades, and an upper bound was obtained for the y₀(ι)=y(ι). Let z*(r) be the output of the kύi cascade. number of required cascade paths, given a Volterra funcThen, for * = 1,2,..., tional of specified memory length and order of nonlinearity.

The parallel cascade identification method^5,6 can be

(i) outlined as follows. A first cascade of dynamic linear and static nonlinear elements is found to approximate the dynamic nonlinear system. The residual, i.e., the differAssume that the output y(i) depends on input values ence between the system and the cascade outputs, is x(i),x(i- \),...,x(i— Λ), i.e., upon input lags 0,...Ji. calculated, and treated as the output of a new dynamic There are many ways to determine a suitable (discrete) nonlinear system. A cascade of dynamic linear and static impulse response function h_k(j), j= ,...Ji for the linear nonlinear elements is now found to approximate the new element L beginning the fcth cascade.^5,6 One practical system, the new residual is computed, and so on. These solution is to set it equal to either the first order cross- cascades are found in such a way as to drive the cross- correlation of the input x(ϊ with the current residual correlations of the input with the residual to zero. It can yj._ι(i), or to a slice of a higher order input/residual be shown that any dynamic nonlinear discrete-time syscrosscorrelation, with discrete impulses added or subtem having a Volterra or a Wiener functional expansion tracted at diagonal values. Thus, set can be approximated, to an arbitrary degree of accuracy in the mean-square sense, by a sum of a sufficient number of the cascades.^5,6 As mentioned above, for general_kU)= „_t.,U) (2) KORENBERG et al. if the first order input residual crosscorrelation Φ_xy. . is

FIGURE 3. Steps for classifying an unknown sequence as h_iU) = Φ_xlX3,_k__ιυ.A₁ ,A₂)± C₁ δ(j-A_i):t C₂δU-A₂), either calcium binding or kinase using a trained parallel cascade model. The MSE ratios lor calcium binding and kinase

(4) are given by Eqs. (9) and (10), respectively. where A_\ ,A_τ,C_\ ,C₂ , are defined similarly to A,C in Eq. (3). yj(i)-yϊ-ι⁽'⁾-*i⁽ . (7)

However, it will be appreciated that many other means can be used to determine the h_k(J), and that the where the overbar denotes the mean (time average) opapproach is not limited to use of slices of crosscorrela- eration: Thus, adding a further cascade to the model tions. More details of the parallel cascade approach are reduces the mean square of the residual by an amount given in Refs. 5 and 6. Once the impulse response h_k(J) equal to the mean square of the cascade output.^5,6 This, has been determined, the linear element's output u_t(ϊ) of course, by itself does not guarantee that the mean- can be calculated as square error (MSE) of approximation can be made arbitrarily small by adding sufficient cascades. However, the cascades can be created in such a way as to drive the

«*(«) = Σ h_k ⁽J)x⁽.i-j). (5) input/residual crosscorrelations arbitrarily close to zero for a sufficient number of cascades. This is the central point that guarantees convergence to the least-square so¬

In Eq. (5), the input lags needed to obtain the linear lution, for a given memory length R+ 1 of the dynamic element's output range from 0 to R, so its memory linear element and polynomial degree D of the static length is Λ + l. nonlinearity. It implies that in the noise-free case a

The signal _k{i) is itself the input to a static nonlindiscrete-time system having a Volterra or a Wiener funcearity in the cascade, which may be represented by a tional expansion can be approximated arbitrarily accupolynomial. Since each of the parallel cascades in the rately by a finite sum of these LN cascades. For a proof present work comprised a dynamic linear element L folof convergence, see Ref. 6. lowed by a static nonlinearity JV, the letter's output is the Once the parallel cascade model has been identified, cascade output we calculate its output [Fig. 2(b)] due to the training input, and also the MSE of this output from the training output for calcium-binding and kinase portions of the training input. Recall that the training output has value

— 1 over the calcium-binding portion, and 1 over the kinase portion, of the training input. Hence we compute

The coefficients a_kj defining the polynomial static non- a first MSE of the model output from -1 for the linearity JV may be found by best fitting, in the least- calcium-binding portion, and a second MSE from 1 for square sense, the output z_kU) to the current residual the kinase portion, of the training input. y -_\(i). Once the jfeth cascade has been determined, the The parallel cascade model can now function as a new residual y_k(i) can be obtained from Eq. (1), and binary classifier as illustrated in Fig. 3, via an MSE ratio because the coefficients au were obtained by best fitting, test. A sequence to be classified, in the form of its the mean square of the new residual is numerical profile *(.) constructed according to the Parallel Cascade Classification of Protein Sequences

SARAHl scale, is fed to the model, and we calculate the corresponding to the first R points of each segment corresponding output joined to form the training input.⁸ This is done to avoid introducing error into the identification due to the transition zones where the different segments of the training

*(«^■)- Σ **(«). (8) input are^' spliced together.

Using this parallel cascade identification and the appropriate training input and output created as described where K is the number of cascade paths in the final earlier, we found three classifier models, each intended model. Next, we compare the MSE of z(i) from — 1, to distinguish between one pair of families. For example, relative to the corresponding MSE for the appropriate a parallel cascade model was identified to approximate training sequence, with the MSE of z(i) from 1, again the input output relation defined by the training data of relative to the MSE for the appropriate training seFig. 2(a). The three models corresponded to the same quence. In more detail, the first ratio is assumed values for certain parameters, namely the memory length R+ 1, the polynomial degree D, the

WO+ i⁾' maximum number of cascades permitted in the model, e\ and a threshold for deciding whether a cascade's reduction of the MSE justified its inclusion in the model. To where e_\ is the MSE of the parallel cascade output from be acceptable, a cascade's reduction of the MSE, divided — 1 for the training numerical profile corresponding to by the mean square of the current residual, had to exceed calcium-binding sequence lSCP..The second ratio comthe threshold T divided by the number of output points puted is /, used to estimate the cascade, or equivalently,

Mi)- 1)²

(10) *ϊ⁽ >7-y.-ι⁽0. (H)

where e₂ is the MSE of the parallel cascade output from This criterion⁶ for selecting candidate cascades was de1 corresponding to kinase sequence 1PFK. In Fig. 3, rj rived from a standard correlation test. and r₂ are referred to as the MSE ratios for calcium The parallel cascade models were identified using the binding and kinase, respectively. Each time an MSE is Fig. 2(a) data, or on corresponding data for training computed, we commence the averaging on the (R globin versus calcium-binding or globin versus kinase + l)th point. If r_\ is less than r₂, the sequence is clasclassifiers. Each time we used the same assumed paramsified as calcium binding, and if greater, as kinase. This eter values, the particular combination of which was MSE ratio test has also. been found to be an effective analogous to that used in the DNA study.⁷ In the latter classification criterion in distinguishing exon from intron work, it was found that an effective parallel cascade DNA sequences.⁷ As noted below, an effective memory model for distinguishing exons from introns could be length R+ 1 for our binary classifiers was 125, correidentified when the memory length was 50, the degree of sponding to a primary amino acid sequence length of 25, each polynomial was 4, and the threshold was 50, with which was therefore the minimum length of the senine cascades in the final model. Since in the DNA study quences which could be classified by the models identithe bases are represented by ordered pairs, whereas here fied in the present paper. the amino acids are coded by 5-tuples, the analogous

Other decision criteria are under active investigation, memory length in the present application is 125. Also, e.g., computing the distributions of output values correthe shortest of the three training inputs here was 4600 sponding to each training input. Then, to classify a novel points long, compared with 818 points for the DNA sequence, compute the distribution of output values corstudy.⁷ Due to the scaling factor of 5/2 reflecting the responding to that sequence, and choose the training discode length change, a roughly analogous limit here is 20 tribution from which it has the highest probability of cascades in the final models for the protein sequence coming. However, only the MSE ratio criterion just disclassifiers. In summary, our default parameter values cussed was used to obtain the results in the present were memory length (R+ 1) of 125, polynomial degree paper. D of 4, threshold T of 50, and a limit of 20 cascades.

Note that, instead of splicing together only one repFigure 2(b) shows that when the training input of Fig. resentative sequence from each family to be distin2(a) is fed through the calcium-binding vs kinase classiguished, several representatives from each family can be fier, the resulting output is indeed predominately negajoined.⁸ It is preferable, when carrying out the identifitive over the calcium-binding portion, and positive over cation, to exclude from computation those output points the kinase portion, of the input. The next section con- KORENBERG et al.

TABLE 3. Correct classification rate for PCI on original test . set. SAM

% Correct % Correct % Correct Classification

Encoding Globin vs Globin vs Cal-Bind vs Mean % scale Cal-Bind Kinase Kinase correct marginal?

SARAH1 8 100 73 100 83 67 85% SARAH2 85 100 79 100 85 67 86% |— — I yes no cerns how the identified parallel cascade models perSubstitute PCI Use SAM formed over the test sets. Classification Classification

CLASSIFICATION RESULTS FOR TEST FIGURE 4. Flow chart showing the combination of SAM, which classifies using hidden Markov models, with parallel SEQUENCES cascade classification to produce the results In Table 4.

All results presented here are for two-way classification, based on training with a single exemplar from the extent to which the PCI result replaced that from SAM globin, calcium-binding, and kinase families. depended on the pair of families involved in the classification task, and ranged from 20% to 80% with an

Original Test. Set average of 60%.

The detailed results are shown in Table 3 for the SARAHl and SARAH2 encoding schemes, with correct CONCLUSION classifications by PCI averaging 85% and 86%, respecParallel cascade identification appears to have a role tively. These should be compared with a 79% average in protein sequence classification when simple two-way success rate in the earlier study⁸ where Rose scale hydistinctions are useful, particularly if lirde training data drophobicity values were instead used to represent the are available. We introduced binary and multilevel codes amino acids. so that each amino acid is uniquely represented and equally weighted. The codes enhance classification accu¬

Large Test Set racy by causing greater variability in the numerical pro¬

The detailed results are shown in Table 4 for PCI files for the protein sequences and thus improved inputs using the SARAHl encoding scheme, for the hidden for system identification, compared with using Rose Markov modeling approach using the SAM system,'' and scale hydrophobicity values to represent the amino acids. for the combination of SAM with PCI. The SARAH2 We are now exploring the use of parallel cascade idenscheme was not tested on this set. Figure 4 shows a flow tification to locate phosphorylation and ATPase binding chart explaining how the combination was implemented. sites on proteins, applications readily posed as binary When the HMM probabilities for the two families to be classification problems. distinguished were very close to each other, the SAM classification was considered to be marginal, and the PCI ACKNOWLEDGMENTS result was substituted. The cutoff used with SAM was 1 in the NLL-NULL score, which is the negative log of The authors thank the referees for their astute comthe probability of a match. This cutoff was determined ments. M.J. . and R.D. thank the real Sarah for numerusing the original test set of 253 protein sequences. The ous protein sequences.

TABLE 4. Correct classification i rate for I PCI, HMM, and REFERENCES combination on i large test set

% Correct % Correct % Correct ¹ Baldi, P., Y. Chauvin, T. Hunkapiller, and M. A. McClure. Hidden Markov models of biological primary sequence infor¬

Globiri I vs Globin vs Cal-Bind vs Mean % mation. Proc. Natl. Acad. Sci. V.SA. 91:1059-1063, 1994.

Method Cal-biπd Kinase Kinase correct ²Cheever, E. A., D. B. Searls, W. Karunaratπe, and G. C.

PCI, 78 97 77 91 66 68 79 Overton. Using signal processing techniques for DNA se¬

SARAH1 quence comparison. Proc. Northeastern Bioengineering Sym¬

HMM 100 4 85 82 43 96 75 posium, Boston, MA, pp. 173-174, 1989.

Combo 88 95 82 90 69 70 82 ³Comette, i. L„ K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky, and C. DeLisi. Hydrophobicity scales and com- Parallel Cascade Classification of Protein Sequences putational techniques for detecting amphipathic structures in Haussler. Hidden Markov models in computational biology — proteins. J. Mol. Biol. 195:659-685, 1987. Applications to protein modeling. J. Mol. Biol. 235:1501-

"Hughey, R., and A. Krogh. Sequence Alignment and model1531, 1994. ing software system, http://www.cse.ucsc.edu/research '"McLachlan, A. D. Multichannel Fourier analysis of patterns compbio/sam.html. in protein sequences. J. Phys. Chem. 97:3000-3006, 1993.

⁵ Korenberg, M. J. Statistical identification of parallel cascades "Palm, G. On representation and approximation of nonlinear of linear and nonlinear systems. Proc. 6th JFAC Symposium systems. Pan : Discrete time. Biol. Cybem. 34:49-52, on Identification and System Parameter Estimation. 1:580- 585, 1982. 1979.

⁶Korenberg, M. I. Parallel cascade identification and kernel ¹²Regelson, M. E. Protein structure function classification using estimation for nonlinear systems. Ann. Biomed. Eng. 19:429— hidden Markov models. Ph.D. Thesis, The Beckman Institute, 455, 1991. California Institute of Technology, Pasadena, CA, 1997.

'Korenberg, M. J., E. D. Lipson, and J. E. Solomon. Parallel ¹³ Stultz, C. M., J. V. White, and T. F. Smith. Structural analycascade recognition of exon and intron DNA sequences (unsis based on state-space modeling. Protein Sci. 2:305-314, published). 1993.

⁸Korenberg, M. J., J. E. Solomon, and M. E. Regelson. Par'"White, J. V., C. . Stultz, and T. F. Smith. Protein classifiallel cascade identification as a means for automatically clascation by stochastic modeling and optimal filtering of amino- sifying protein sequences into structure function groups. Biol. acid-sequences. Math. Biosci. 119:35-75, 1994. Cybern. 82:15-21, 2000. ¹⁵Wiener, N. Nonlinear Problems in Random Theory. CamKrogh, A., M. Brown, I. S. Mian, K. Sjolander, and D. bridge, MA: MIT Press, 1958.

Appendix C: sequences (M.J. Korenberg, E.D. Lipson, and J.E. Solomon, "Parallel Cascade Recognition of Exon and Intron DNA Sequences", pages 2-23

Parallel Cascade Recognition of Exon and Intron DNA Sequences

Michael J Korenberg, Edward D. Lipson, and Jerry E. Solomon

Abstract Many of the current procedures for detecting coding regions on human DNA sequences actually rely on a combination of a number of individual techniques such as discriminant analysis and neural net methods. A recent paper introduced the notion of using techniques from nonlinear systems identification as one means for classifying protein sequences into their structure/function groups. The particular technique employed, called parallel cascade identification, achieved sufficiently high correct classification rates to suggest it could be usefully combined with current protein classification schemes to enhance overall accuracy. In the present paper, parallel cascade identification is used in a pilot study to distinguish coding (exon) from noncoding (intron) human DNA sequences. Only the first exon, and first intron, sequence with known boundaries in genomic DNA from the β T-cell receptor locus were used for training. Then, the parallel cascade classifiers were able to achieve classification rates of about 89% on novel sequences in a test set, and averaged about 82% when results of a blind test were included. These results indicate that parallel cascade classifiers may be useful components in future coding region detection programs.

Key Words: Nonlinear Systems, Identification, Exons, Introns, DNA Sequences. INTRODUCTION

Automatic methods⁶'¹¹'¹⁵'¹⁷'¹⁸ of interpreting human DNA sequences, such as for detecting coding regions in gene identification, have received increasing attention over the past 10 years. One of the most successful systems, known as GRAIL, uses a neural network for recognizing coding regions on human DNA, which combines a series of coding prediction algorithms¹⁷. Indeed, many systems for coding region detection rely on combining a number of underlying pattern recognition techniques, such as discriminant analysis and neural net methods⁴. In the present paper, we show that a method for nonlinear system modeling, called parallel cascade identification⁷'⁸, can provide a further technique for distinguishing coding (exon) from noncoding (intron) DNA sequences, that may combine with existing techniques to enhance accuracy of coding region detection.

In a recent paper¹⁰ parallel cascade identification was used to build classifiers for distinguishing amongst the globin, calcium-binding, and kinase protein families. One advantage of this approach is that it requires very few examples of sequences from the families to be distinguished in order to train effective classifiers. Another advantage is that the approach taken differs significantly from currently used methods, for example those based on hidden Markov modeling.¹'¹³ Consequently, parallel cascade classifiers may tend to make their mistakes on different sequences than do hidden Markov models, so that the two approaches might well be combined to enhance classification accuracy. The same potential exists in distinguishing exons from introns, and was the motivation for the present study, which essentially tests the feasibility of applying parallel cascade identification to the latter problem.

The sequences we utilized were from the human β T-cell receptor locus, as published by Rowen, Koop, and Hood.¹⁴ Only those exons and introns were used which (a) had precisely known boundaries and (b) were located on the same strand. No attempt was made to select favorable sequences for training the parallel cascade classifier. Rather, the first intron and exon on the strand were used to train several parallel cascade models that corresponded to trial values of certain parameters such as memory length and degree of nonlinearity. Next, some succeeding introns and exons were used as an evaluation set to select the best one of the candidate parallel cascade classifiers. The selected classifier was then assessed on subsequent, novel introns and exons in a test set to measure its correct classification rate. Lastly, a blind test was carried out on a final "unknown" set of introns and exons.

As shown below, the parallel cascade model trained on the first exon and intron attained correct classification rates of about 89% over the test set. The model averaged about 82% over all novel sequences in the test and "unknown" sets, even though the sequences therein were located at a distance of many introns and exons away from the training pair. These results compare favorably with those reported in the literature⁴'⁵, and suggest that parallel cascade classifiers may be employed in combination with currently used techniques to enhance detection of coding regions.

SYSTEM AND METHODS

The exon intron differentiation algorithm used the same program to train the parallel cascade classifiers as for protein classification^9,10, and was implemented in Turbo Basic on a 166 MHz Pentium MMX. Training times depended on the manner used to encode the sequence of nucleotide bases, but were generally only a few minutes long, while subsequent recognition of coding or noncoding regions required only a few seconds or less. Two numbering schemes were utilized to represent the bases, based on an adaptation of a strategy employed by Cheever et al.² The latter authors² used a complex number representation for the bases, with the complementary pair Adenine (A) and Thymine (T) represented by 1 and -1 respectively, and complementary pairs Guanine (G) and Cytosine (C) by i and -i respectively, where i = V-f . This requires use of two signals to represent a DNA sequence, one for the real and one for the imaginary parts of the representation.

The need for two signals was avoided in the present paper by adapting the above scheme as follows. Bases A, T, G, and C were replaced with the pairs (0, 1), (0, -1), (1, 0), and (-1, 0) (the brackets are for clarity of display and were not actually used). The pairs could be assigned to the bases in the order given here, but as discussed below, a number of alternative assignments were about equally effective. The use of number pairs doubled the length of each sequence, but allowed representation by one real signal. While the scheme was effective, a disadvantage was that, given only a segment of the resulting signal, the start of each number pair might not always be clear.

Accordingly, a modification was used in which the number pairs were replaced with the corresponding triplets (0, 1, CU), (0, -1, -0.1), (1, 0, 0.1), and (-1, 0, -0J) for bases A, T, G, and C. This way, the endpoint of each triplet was always clear; however as shown below, such a representation produced very similar results to those from using number pairs. Of course, the bases could have been assigned corresponding numbers such as 2, 1, -1, -2, but this would attribute unequal magnitudes to them, and indeed the latter representation proved less effective in trials.

No preprocessing of the sequences was carried out, such as screening for interspersed and simple repeats. This has been recommended in the literature⁴ as a first step to precede all other analysis since these repeats rarely overlap coding portions of exons. Nor was any attempt made to select the sequences for the present study according to their length, although sequence length has been found to be a major factor affecting which programs for gene identification can be used.⁴ The aim was to investigate the ability of parallel cascade classifiers to distinguish exons from introns in isolation from any other recommended procedures or considerations.

Four non-overlapping sets of data were used in the present study: (i) The training set comprised the first precisely determined intron (117 nucleotides in length) and exon (292 nucleotides in length) on the strand. This intron / exon pair was used to train several candidate parallel cascade models for distinguishing between the two families, (ii) The evaluation set comprised the succeeding 25 introns and 28 exons with precisely determined boundaries. The introns ranged in length from 88 to 150 nucleotides, with mean length 109.4 and standard deviation 17.4. For the exons, the range was 49 to 298, with mean 277.4 and standard deviation 63.5. This set was used to select the best one of the candidate parallel cascade models, (iii) The test set consisted of the succeeding 30 introns and 32 exons whose boundaries had been precisely determined. These introns ranged from 86 to 391 nucleotides in length, with mean 134.6 and standard deviation 70.4. The exon range was 49 to 304 nucleotides, with mean 280.9 and standard deviation 59.8. This set was used to measure the correct classification rate achieved by the selected parallel cascade model, (iv) The "unknown" set comprised 78 sequences, all labeled exon for purposes of a blind test, though some sequences were in reality introns. They ranged in length from 18 (two sequences) to 1059 nucleotides, with mean 331.4 and standard deviation 293.5. For reasons explained in the next section, a minimum length of 23-24 nucleotides was required for sequences to be classified by the selected parallel cascade model constructed in the present paper, so only the two shortest sequences had to be excluded. The three shortest remaining sequences had lengths of 38, 49, and 50 nucleotides.

DETERMINING THE PARALLEL CASCADE CLASSIFIERS

Parallel dynamic linear, static nonlinear, dynamic linear cascade models were introduced by Palm¹² in 1979 to uniformly approximate discrete-time systems approximatable by Nolterra series. It is of interest to note that in Palm's model, the static nonlinearities were exponential or logarithmic functions. Palm¹² did not suggest a procedure for identifying the model, but his elegant paper motivated much future work. Subsequently, Korenberg^7,8 showed that, for generality of approximation, it sufficed if each cascade comprised a dynamic linear element followed by a static nonlinearity, provided that the static nonlinear elements were polynomials, and provided a general identification procedure for obtaining the model.

In the present paper, the parallel cascade models for distinguishing exons from introns were obtained by the same steps as for the protein sequence classifiers in the earlier studies.⁹'¹⁰ Briefly, we begin by converting each available sequence from the families to be distinguished into a numerical profile. In the case of protein sequences, a property such as hydrophobicity, polarity or charge might be used to map each amino acid into a corresponding value, which may not be unique to the amino acid (the Rose scale³ maps the 20 amino acids into 14 hydrophobicity values). In the case of a DΝA sequence, the bases can be encoded using the number pairs or triplets described in the previous section. Next, we form a training input by splicing together one or more representative profiles from each family to be distinguished. Define the corresponding training output to have a different value over each family, or set of families, which the parallel cascade model is to distinguish from the remaining families.

For example, when the number pairs set out above represented the bases A, T, G, and C, the numerical profiles for the first intron and exon, which were used for training, comprised 234 and 584 points respectively (twice the numbers of corresponding nucleotides). Splicing the two profiles together to form the training input x(i) , we specify the corresponding output y(i) to be -1 over the intron portion, and 1 over the exon portion, of the input (Fig. la). Parallel cascade identification was then used to create a model with approximately the input / output relation defined by the given χ(i), y(i) data. Clearly such a model would act as an intron / exon classifier, or at least be able to distinguish between the two training exemplars. An approach to creating such a parallel cascade model is described in refs. 7, 8, and was used to obtain both the protein sequence classifiers in the earlier studies⁹'¹⁰, and the intron / exon classifiers in the present work. Since the same algorithm was utilized to create the classifiers in each case, only a brief description is given below; and the interested reader is referred to the earlier publications for more details.

The type of solution required is known as "black box" identification, because we seek to identify a dynamic (i.e., has memory) nonlinear system of unknown structure, defined only by its input C(J) and output y(ι)₅ / = 0,...,/ • A simple strategy^7,8 is to begin by finding a first cascade of alternating dynamic linear (L) and static nonlinear (N) elements to approximate the given input output relation. The residue, i.e., the difference between the outputs of the dynamic nonlinear system and the first cascade, is treated as the output of a new nonlinear system. A second cascade of alternating dynamic linear and static nonlinear elements is found to approximate the latter system, and the new residue is computed. A third cascade can be found to improve the approximation, and so on.

The dynamic linear elements in the cascades can be determined in a number of ways, e.g., using crosscorrelations of the input with the latest residue while, as noted above, the static nonlinearities can conveniently be represented by polynomials.⁷'⁸ The particular means by which, the cascade elements are found is not crucial to the approach. However these elements are determined, a central point is that the resulting cascades are such as to drive the input / residue crosscorrelations to zero.^7,8 Then under noise-free conditions, provided that the dynamic nonlinear system to be identified has a Volterra or a Wiener¹⁶ functional expansion, it can be approximated arbitrarily accurately in the mean-square sense by a sum of a sufficient number of the cascades.^7,8

As noted above, for generality of approximation, it suffices if each cascade comprises a dynamic linear element followed by a static nonlinearity, and this LN cascade structure was employed in the present work. However, it will be appreciated that additional alternating dynamic linear and static nonlinear elements could optionally be inserted in any path.⁷'⁸

In order to identify a parallel cascade model, a number of basic parameters first have to be specified:

1. the memory length of the dynamic linear element beginning each cascade;

2. the degree of the polynomial static nonlinearity which follows the linear element;

3. the maximum number of cascades allowable in the model; and

4. a threshold based on a standard correlation test for determining whether a cascade's reduction of the mean-square error (mse) justified its addition to the model. A cascade was accepted provided that its reduction of the mse, divided by the mean- square of the current residue, exceeded the threshold divided by the number of output points used in the identification.

These parameters were set by identifying candidate parallel cascade models corresponding to assumed values of the parameters and comparing their effectiveness in distinguishing introns fro exons over the evaluation set. In the case of parallel cascade classifiers for protein sequences, a memory length of 25 for each linear element, and a polynomial degree of 5-7 for the static nonlinearities, were utilized.¹⁰ A 25 point memory means that the output of the linear element depends on the input at lags of 0,...,24. Since representation of each base by a number pair doubled input length, a memory length of 50 was initially tried for the linear elements. The first trial value for polynomial degree was 5.

For these choices of memory length and polynomial degree, each LN cascade added to the model introduced 56 further variables. The training input and output each comprised 818 points. However, for a memory length of 50, the parallel cascade model would have a settling time of 49, so we excluded from the identification the first 49 output points corresponding to each segment joined to form the input. This left 720 output points available for identifying the parallel cascade model, which must exceed the total number of variables introduced in the model. To allow some redundancy, a maximum of 12 cascades was allowed. This permitted up to 672 variables in the model, about 93% of the number of output data points used in the identification. While such a large number of variables is normally excessive,' there was more latitude here because of the "noise free" experimental conditions. That is, the DNA sequences used to create the training input were precisely known, and so was the training output, defined to have value -1 for the intron portion, and 1 for the exon portion, of the input as described above. Once a parallel cascade model was identified for assumed values of the basic parameters listed above, its effectiveness is gauged over the evaluation set. A DNA sequence to be classified, in the form of its numerical profile, is fed to the parallel cascade model, and the corresponding output z(ϊ) is computed. The classification decision is made using an mse ratio test.⁹ Thus, the ratio of the mse of z(ϊ) from -1, relative to the corresponding mse for the training intron profile, is compared with the ratio of the mse of z(ϊ) from 1, relative to the mse for the training exon profile. If the first ratio is smaller, then the sequence is classified as an intron; otherwise it is classified as an exon. In computing the mse values for z(ι) and for the training profiles, the averaging begins after the parallel cascade model has "settled". That is, if R+ 1 is the memory, of the model, so that its output depends on input lags 0,...,R, then the averaging to compute each mse commences on the (R+ 1 )-th point. This assumes that the numerical profile corresponding to the DNA sequence is at least as long as the memory of the parallel cascade model. As discussed in the next section, when the number pairs were used to represent the bases, a memory length of 46-48 proved effective. This means that a DNA sequence must be at least 23-24 nucleotides long to be classifiable by the selected parallel cascade model constructed in the present paper.

RESULTS FOR THE EVALUATION AND TEST SETS

When the evaluation set was used to judge candidate classifiers, it was found that an effective parallel cascade model could be trained on the first available intron and exon if the memory length R+1 was 46, and polynomial degree was 5. Using a threshold of 50 resulted in the maximum number of 12 cascade paths being chosen out of 1000 candidates. The %mse of the identified model, defined to be the mse divided by the variance of _j>(j) , times 100%, was 33.3%. Recall that the training input x(i) and output y(i) of Fig. la were used to identify the model. Figure lb shows that when the training input is fed through the identified model, the calculated output z(i) indeed tends to be negative over the intron portion, and positive over the exon portion, of the input. Moreover, the model correctly classified 22 of the 25 introns, and all 28 exons, in the evaluation set, and based on this performance the classifier was selected to measure its correct classification rate on the novel sequences in the test and "unknown" sets.

Over the test set, the model identified 25 (83%) of the 30 introns and 30 (94%) of the 32 exons, for an average of 89%. In the blind test over the "unknown" set, the model recognized 28 (72%) of 39 introns and 29 (78%) of 37 exons, a 75% average. Thus over the test and "unknown" sets, the correct classifications averaged 82%.

Once the results for the "unknown" set were available, any further tests on this set would not be blind. The "unknown" set was therefore pooled with the test set; this larger test set was then used to show that the correct classification rate was relatively insensitive to a decrease in the degree of the polynomial static nonlinearities utilized in the parallel cascade models.

For example, when the polynomial degree was decreased to 4, but all other parameter settings were left unchanged, 9 cascades were selected for the final model, leaving a %mse of 42.1% over the training intron and exon. The model performed identically to the higher degree polynomial model over the evaluation set, and the correct classification rate over the larger test set was only slightly better than it was for the latter model.

Decreasing the polynomial degree to 3 caused the resulting model to be more accurate over the evaluation set but, as discussed later, the performance did not improve significantly over the larger test set. Thus, it was discovered that "better" classifiers, as gauged over the evaluation set, could be obtained for memory length 46 when the polynomial degree was 3, if the threshold was set at 30, which resulted in only 2 cascade paths being selected out of 1000 candidates. The %mse rose to 73.2%, and as shown in Fig. Ic, the model output z(ϊ) for the training input strays much further from the "ideal" output y(ϊ) . The classification ability appeared to improve: now all 25 introns, and 27 of 28 exons, in the evaluation set are recognized.

While effective classifiers had been obtained, it was important to establish that the success was not a fortuitous result of how the number pairs had been assigned to the bases A, T, G, and C. Complementary bases had been represented by number pairs that were the "negative" of each other; e.g., A = (0, 1) and T = (0, -1). Within this limitation, there are 8 ways that the pairs (0, 1), (0, -1), (1, 0), and (-1, 0) can be assigned to bases A, T, G, and C. For each of the 8 representations, the correct classification rate over the evaluation set was determined, when the memory length was set at 46, the polynomial degree at 3, the threshold at 30, and a maximum of 12 cascades were permitted in the final model.

A biochemical criterion was found for different representations to be almost equally effective: namely, the number pairs for the purine bases A and G had to have the same "sign", which of course meant that the pairs for the pyrimidine bases C and T must also be of same sign. That is, either the pairs (1, 0) and (0, 1) were assigned to A and G in arbitrary order, or the pairs (-1, 0) and (0, -1), but it was not effective for A and G to be assigned pairs (-1, 0) and (0, 1), or pairs (1, 0) and (0, -1). In fact, the limitation to number pairs of same sign for A and G was the only important restriction. Number pairs for complementary bases did not have to be negatives of each other; for example, setting A = (0, 1), G = (1, 0), T = (-1, 0), and C = (0, -1) resulted in an effective classifier over the evaluation set.

While a memory length of 46 had proved effective, we had expected that the best memory length would be a multiple of 6, because each codon is a sequence of three nucleotides, and each base is represented by a number pair. Therefore, for various assignments of the number pairs to the bases, we investigated whether a memory length of 48 was preferable. It was found that, with the number pair assignment set out at the end of the last paragraph, the model fit over the training data improved significantly when the memory length was increased from 46 to 48. Indeed, the mse then fell to 65.9%, which was the lowest value for any of the number pair assignments tested when the memory length was 48. Again, the polynomial degree for each static nonlinearity was 3, and the threshold was 30, which resulted in four cascades being selected out of 1000 candidates. When the resulting parallel cascade classifier was appraised over the evaluation set, it recognized all 25 introns, and 27 of 28 exons, equaling the performance reported above for another classifier. However, the correct classification rate over the larger test set was about 82%, no better than it was for the models with higher degree polynomial nonlinearities.

Finally, we investigated whether following a number pair with a small value such as ± 0.1 , to indicate the endpoint of each base representation, affected classification accuracy. With A = (0, 1, 0J), T = (0, -1, -0J), G = (1, 0, 0J), and C = (-1, 0, -0J), a memory length of 72 was utilized to correspond to the representation by triplets rather than pairs, and the polynomial degree and threshold were again set at 3 and 30 respectively. Two cascades were selected out of 1000 candidates, leaving a %mse of 72.9%. The resulting classifier recognized all 25 introns, and 26 of 28 exons, in the evaluation set. This performance does not deviate significantly from that observed above for the use of number pair representations. However, an exhaustive evaluation over every possible assignment of the number triplets to the bases has yet to be carried out. DISCUSSION

This paper describes a pilot study meant to test the feasibility of using parallel cascade identification as a means for automatically distinguishing exon from intron DNA sequences. To achieve the results reported above, only the first precisely determined intron and exon available were utilized to form the input for training the parallel cascade models. The succeeding 25 introns and 28 exons with known boundaries were used as an evaluation set to discover the optimal values for four parameters that had to be preset before each model could be identified. The selected parallel cascade model attained a correct classification rate averaging about 82% on novel sequences from a test set and an "unknown" set used in a blind test.

A number of possibilities exist for improving these results and will be explored in a future paper. First, since only a single exemplar of an intron and exon was utilized to form the training input above, we will investigate whether any increase in classification accuracy ensues from training with multiple examples of these sequences. One way of doing this is to concatenate a number of intron and exon sequences together to form the training input, with tine corresponding output defined to be -1 over each intron portion, and 1 over each exon portion, of the input. Another alternative is to construct several parallel cascade classifiers, each trained using different exemplars of introns and exons, and have the classifiers vote on each classification. It can be shown that provided the classifiers do not tend to make the same mistakes, and are correct most of the time, then accuracy will be enhanced by a voting scheme.

Second, the above results were attained using only parallel cascade identification⁷'⁸, rather than a combination of several methods as is the preferred practice⁴. For example, one study⁵ used four coding measures, which were combined in a linear discriminant in order to distinguish coding from noncoding sequences. Half the successive 108 base pair windows in GenBank that were either fully coding or fully noncoding was used to determine parameter values for the coding measures. The other half was used to measure the correct classification rate of the resulting discriminant, which was found⁵ to be 88%.

The selected parallel cascade classifier in the present paper, operating alone, was about 89% correct over the test set, and about 82% over all novel sequences from the test and "unknown" sets. Thus, it appears from the present pilot project that parallel cascade identification merits consideration as one further component in a coding recognition scheme. Future work will investigate whether the different direction taken by building parallel cascade classifiers results in a tendency to make different classification errors from other methods, and if so, whether such a tendency can be exploited to create a new, more accurate combination of approaches.

REFERENCES

1. Baldi, P., Y. Chauvin, T. Hunkapiller, and M.A. McClure. Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA. 91: 1059-1063, 1994.

2. Cheever, E. A., D. B. Searls, W. Karunaratne, and G. C. Overton. Using signal processing techniques for DNA sequence comparison. Proc. Northeastern Bioengineering Symp., Boston, MA, pp. 173-174, 1989.

3. Cornette, J.L., K.B. Cease, H. Margalit, J.L. Spouge, J.A. Berzofsky, and C. DeLisi. Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J. Mol. Biol. 195: 659-685, 1987.

4. Fickett, J. W. Predictive methods using nucleotide sequences. In: Bioinformatics: a practical guide to the analysis of genes and proteins, edited by A. D. Baxevanis and B. F. F. Ouellette. New York, NY: Wiley, 1998, pp. 231-245.

5. Fickett, J. W. and C.-S. Tung. Assessment of protein coding measures. Nucl. Acids Res. 20: 6441-6450, 1992.

6. Gelfand, M. S., A. A. Mironov, and P. A. Pevzner. Gene recognition via spliced alignment. Proc. Natl. Acad. Sci. U.S.A. 93: 9061-9066, 1996.

7. Korenberg, M.J. Statistical identification of parallel cascades of linear and nonlinear systems. IFAC Symp. Ident. Sys. Param. Est. 1: 580-585, 1982.

8. Korenberg, MJ. Parallel cascade identification and kernel estimation for nonlinear systems. Ann. Biomed. Eng. 19: 429-455, 1991.

9. Korenberg, R. David, I. W. Hunter, and J. E. Solomon. Automatic classification of protein sequences into structure/function groups via parallel cascade identification. Submitted for publication. 10. Korenberg, MJ., J.E. Solomon, and .M.E. Regelson. Parallel cascade identification as a means for automatically classifying protein sequences into structure/function groups. Biol. Cybem. (in press)

11. Milanesi, L., N. A. Kolchanov, I. B. Rogozin, I. V. Ischenko, A. E. Kel, Y. L. Orlov, M. P. Ponomarenko, and P. Vezzoni. Gen View: A computing tool for protein-coding regions prediction in nucleotide sequences. In: Proc. Of the Second International Conference on Bioinformatics, Supercomputing and Complex Genome Analysis, edited by H. A. Lim, J. W. Fickett, C. R. Cantor, and R. J. Robbins. Singapore: World Scientific Publishing, 1993, pp. 573-588.

12. Palm, G. On representation and approximation of nonlinear systems. Part II: Discrete time. Biol. Cybern. 34: 49-52, 1979.

13. Regelson, M.E. Protein structure/function classification using hidden Markov models. Ph.D. Thesis, The Beckman Institute, California Institute of Technology, Pasadena, CA, 1997.

14. Rowen, L., B. F. Koop, and L. Hood. The complete 685-kilobase DNA sequence of the human beta T cell receptor locus. Science 272 (5269): 1755-1762, 1996.

15. Snyder, E. E. and G. D. Stormo. Identifying genes in genomic DNA sequences. In: DNA and Protein Sequence Analysis: A Practical Approach, edited by M. J. Bishop and C. J. Rawlings. Oxford: IRL Press, 1996, pp. 209-224.

16. Wiener, N. Nonlinear Problems in Random Theory. Cambridge, MA: MIT Press, 1958.

17. Xu, Y., J. R. Einstein, R. J. Mural, M. Shah, and E. C. Uberbacher. An improved system for exon recognition and gene modeling in human DNA sequences. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, edited by R. FIGURE LEGEND

Figure 1 :

(a) The training input x(i) (dotted line) and output y(i) (solid line) utilized in identifying the parallel cascade model intended to distinguish exon from intron sequences. The input was formed by splicing together the numerical profiles for the first precisely known intron and exon sequence on the strand, with the bases A, T, G, C represented respectively by (0, 1), (0, -1), (1, 0), (-1, 0). The corresponding output was defined to be -1 over the intron and 1 over the exon portions of the input.

(b) The training output y(i) (solid line), and the output z(i) (dotted line) calculated when the training input in (a) stimulated the identified parallel cascade model having 5^th degree polynomial static nonlinearities. Note that the output z(i) is predominately negative over the intron, and positive over the exon, portions of the input.

(c) The training output y(j) (solid line), and the new output z(ϊ) (dotted line) calculated when the training input in (a) stimulated the identified parallel cascade model having 3^rd degree polynomial static nonlinearities. Although the output _z(i) is not as predominately negative over the intron portion as in (b), its greater variance over this portion is used in the classification decision criterion to successfully distinguish novel exons from introns. Figure 1(a)

Figure 1(b)

Figure 1(c)

Claims

I claim:

1. A method of predicting whether an unclassified sample of biological information corresponds to a first class or to a second class, comprising: (a) selecting a first known sample of biologic information, wherein the first sample corresponds to the first class;

(b) selecting a second known sample of biologic information, wherein the second sampje correspond to the second class;

(c) selecting a first representative segment from said first sample; (d) selecting a second representative segment from said second sample;

(e) combining the first and second representative segments to form a training input;

(f) defining an input/output relationship between the training input and a training output, wherein the training output has a first value over said first representative segment and a second value of said second representative segment;

(g) calculating a non-linear system to approximate the input/output relationship; and (h) applying said non-linear system to said unclassified sample to calculate a set of output values.

2. The method of claim 1 further comprising:

(i) comparing said output values to said first value and second value to determine a first difference and a second difference and assigning the unclassified sample to the class corresponding to the smaller difference.

3. The method of claim 1 further comprising: (a) calculating a first mean squared error between said output values and first value; (b) calculating a second mean squared error between said output values and said second value; (c) if said first mean squared error is less than said second mean squared error, then predicting that said unclassified samples is a member of said first class; and

(d) if said first mean squared error is greater than said second mean squared error, then predicting that said unclassified sample is a member of said second class.

4. The method of claim 1 wherein each of said first and second samples comprise a set of components and wherein steps (c) and (d) are performed by: i. comparing corresponding pairs of components from first and second samples and identifying one or more components of said first and second samples that exhibit the greatest difference; ii. selecting said identified components of said first sample as said first segment; and iii. selecting said identified components of said second sample as said second segment.

5. The method of claim 1 wherein each sample of biological information comprises a set of gene expression profiles.

6. The method of claim 4 wherein each sample of biological information comprises a set of gene expression profiles and wherein in step (i), said components are compared by calculating their absolute difference.

7. The method of claim 1 wherein each sample of biological information comprises a protein sequence.

8. The method of claim 1 wherein each sample of biological information corresponds to a nucleic acid.

9. The method of claim 1 wherein said training input is formed by concatenating said representative segments.

10. The method of claim 1 wherein each of said output values depends on a corresponding input value, advanced or lagged input values and lagged output values, wherein the advanced and lagged input values are contained within an interval of length not exceeding the length of first and second representative segments.

11. The method of claim 4 wherein said interval has a length less than the length of each representative segment.

12. A method of predicting whether an unclassified sample of biological information corresponds to a one or N classes, comprising: (a) selecting a known sample of biologic information corresponding to each of said N classes;

(b) selecting a representative segment from each of said known samples;

(c) combining said representative segments to form a training input; (d) defining an input/output relationship between the training input and a training output, wherein the training output has a distinct value over each of said representative segments; (g) calculating a non-linear system to approximate the input/output relationship; and (h) applying said non-linear system to said unclassified sample to calculate a set of output values.

13. A method for predicting the class of a patient, comprising:

(a) obtaining a diagnostic profile representative of measured amounts of a plurality of cellular constituents in one or more cells of said patient;

(b) obtaining an input signal from said diagnostic profile;

(c) applying said input signal to a mathematical model to obtain an output; and (d) using said output to make a prediction of the class of said patient.

14. A method for determining the class of a patient, comprising: (a) obtaining a diagnostic profile representative of measured amounts of a plurality of cellular constituents in one or more cells of said patient;

(b) obtaining a signal from said diagnostic profile; (c) using said signal to obtain a mathematical model corresponding to said signal, and mathematical parameters describing said model; and (d) determining, according to said mathematical model and said parameters, the class of said patient.

15. A method as in claim 15 or 16, wherein the patient has a diseased state, and wherein the prediction of the class is a diagnosis of the diseased state.

16. A method as in claim 15 or 16, wherein the patient has already been diagnosed with a diseased state, and wherein the prediction of the class is a prognosis of clinical outcome of the patient.

17. A method as in claim 18, wherein said prognosis includes a prediction of whether said patient is a candidate for a drug for the treatment of said diseased state

18. A method as in claim 15 or 16, wherein said diagnostic profile is representative of the expression levels of selected genes in the patient.

19. A method as in claim 15 or 16, wherein said diagnostic profile is representative of proteomics data.

20. A method for obtaining a mathematical model for predicting the class of gene expression profiles, given at least one profile exemplar of each class to be distinguished, comprising:

(a) comparing the exemplars of the different classes to select a plurality of genes that assist in distinguishing between the classes; (b) for each exemplar, appending in the same order expression levels, of the selected genes, from the exemplar to form a representative segment of the class of the exemplar;

(c) concatenating the representative segments of the classes to form a training input;

(d) defining an input/output relation by creating a training output having values corresponding to the input values, where the output values differ for representative segments of the input from different classes; and (e) finding a finite-dimensional system to approximate the input/output relation.

21. A method for finding a mathematical model for each class of gene expression profiles to be distinguished, given at least one profile exemplar of each of the classes, comprising:

(a) for each class, using at least one profile exemplar to obtain a training input comprising a sequence of values;

(b) for each class, obtaining a training output by shifting the input signal to advance the sequence; and (c) for each class, finding a finite-dimensional system to approximate the relation between the training input and output for the class.

22. A method as in claim 22 or 23, wherein said classes are diseases.

23. A method as in claim 22 or 23, wherein said classes are clinical outcomes for patients.

24. A method as in claim 22, wherein said order in step (b) is chosen to assist in distinguishing between the different classes.

25. A method for assigning a query gene expression profile to one of a plurality of classes, comprising:

(a) reading expression levels of selected genes in the profile that assist in distinguishing between the classes; (b) appending in a specified order the expression levels of the selected genes to form an input signal; and

(c) applying the input signal to a nonlinear system in order to obtain an output signal.

26. A method as in claim 27, wherein said order in step (b) was specified so as to assist in distinguishing between the classes.

27. A method as in claim 15 or 16, wherein said mathematical model was obtained using one of parallel cascade identification, fast orthogonal search, and iterative fast orthogonal search.

28. A method for constructing a class predictor of gene expression profiles, using at least one profile exemplar from each class to be distinguished, comprising:

(a) comparing the exemplars from the different classes to select a plurality of genes that assist in distinguishing between the classes;

(b) for each exemplar, appending expression levels of the selected genes from the exemplar to form a representative segment of the class of the exemplar;

(d) defining an input/output relation by creating a training output having values corresponding to the input values, where the output has at least one different value over each representative segment from a different class; and

(e) finding a finite-dimensional system to approximate the input/output relation.

29. A method as claimed in Claim 1 , wherein each output value of the system depends upon at most the corresponding input value, advanced or lagged input values, and lagged output values, and wherein the advanced and lagged values are contained within an interval of length not exceeding the length of each representative segment.

30. A method as claimed in Claim 2, wherein the interval has a length less than the length of each representative segment.

31. A method as claimed in Claim 2 or 3, wherein the finite-dimensional system has a memory length not exceeding the length of each representative segment.

32. A method as claimed in Claim 1 or 2, wherein the finite-dimensional system contains a plurality of cascades of dynamic linear and static nonlinear elements.

33. A method for assigning a query gene expression profile to one of several classes, comprising:

(a) reading expression levels of selected genes in the profile that most distinguish between the classes; (b) appending the expression levels of the selected genes to form an input signal; and (c) applying the input signal to a nonlinear system in order to obtain an output signal.

34. A method for constructing a class predictor of gene expression profiles, using a plurality of profile exemplars from each class to be distinguished, each profile having a set of expression levels, comprising:

(a) dividing the plurality of profile exemplars into subsets, each subset containing the expression levels of corresponding genes from the plurality of exemplars;

(b) obtaining at least one model for class prediction for each subset of expression levels; and - (c) providing a means for using the models to predict the class of a query sequence.

35. A method for predicting the class of a query gene expression profile, comprising:

(a) applying a segment representative of the query profile to a finite- dimensional system in order to obtain an output signal;

(b) applying the method described in Golub et al. (1999) to obtain "weighted votes" of a set of "informative genes" and a "prediction strength"; and

(c) using the output signal, and at least one of the weighted votes and the prediction strength, to obtain a prediction of the class of the query sequence.

36. A method as claimed in Claim 37, wherein the output signal is used to predict the class of the query profile when the prediction strength is below a predetermined threshold.

37. A method as claimed in Claim 37, wherein at least one of a mean, a variance, and a higher-order moment is calculated from the output signal, and used in combination with at least one of the weighted votes and the prediction strength to obtain a prediction of the class of the query sequence.

38. A method as claimed in Claim 37 or 39, wherein the finite-dimensional system contains a plurality of cascades of dynamic linear and static nonlinear elements, and herein the mean of the output signal is calculated, and for each class a ratio is calculated of the MSE of the output from the training output value for that class, relative to the MSE for training profiles of that class, and wherein the mean and MSE ratios are used with the weighted votes and the prediction strength to obtain a prediction of the class of the query sequence.