WO1999039174A2 - An automatic method of classifying molecules - Google Patents
An automatic method of classifying molecules Download PDFInfo
- Publication number
- WO1999039174A2 WO1999039174A2 PCT/IL1999/000057 IL9900057W WO9939174A2 WO 1999039174 A2 WO1999039174 A2 WO 1999039174A2 IL 9900057 W IL9900057 W IL 9900057W WO 9939174 A2 WO9939174 A2 WO 9939174A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- clusters
- molecules
- proteins
- cluster
- similarity
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1089—Design, preparation, screening or analysis of libraries using computer algorithms
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
Definitions
- the present invention relates to an automatic method and system for the identification of biological families of molecules, in particular proteins, and a method for creating an hierarchical organization within those families.
- homologous proteins almost always have the same fold (Pearson 1996). Homology is, by definition, a transitive relation: If A is homologous to B, and B is homologous to C, then A is homologous to C. This simple observation can be very effective in discovering homology. However, when applied simple-mindedly, this observation leads to many pitfalls.
- Multidomain proteins make the deduction of homology particularly difficult. If protein 1 contains domains A and B, protein 2 contains domains B and C, protein 3 contains domains C and D, then should proteins 1 and 3 be considered homologous? This simple example indicates the inadequacy of single-linkage clustering for the purpose of identifying protein families within the sequence space. Expert biologists can distinguish significant from insignificant similarities. However, the sheer size of current databases rules out an exhaustive manual computation of homologies. Our goal was to develop an automatic method for classification of protein sequences based on sequence similarity, through the detection of groups of homologous proteins (clusters), and high level structures (groups of related clusters) within the sequence space. Such organization would reveal relationships among protein families and yield deeper insights into the nature of newly discovered sequences.
- Global self organization of all known protein sequences reveals inherent bilogical signatures.
- the present invention provides an automatic method of classifying molecules having similar biologic function comprising the steps of: a) creating a hierarchical organization of said molecules in a database, wherein groups of clusters are identified using local consideration resulting in related clusters; b) determining the position of a selected molecule based on the hierarchical organization of step a, whereby selected molecules of similar biologic function are classified.
- the present invention further provides an automatic method of classifying molecules having similar biologic function comprising the steps of: a) creating a hierarchical organization of the molecules in a database comprising the steps of: i) calculating pairwise similarities between said molecules in said database by combining at least one standard measure of similarity, resulting in a first set of expectation values of similarity; ii) analyzing said first set of expectation values of similarity so as to obtain a second set of expectation values of similarity, wherein said molecules of the second set have a high degree of similarity; iii) merging said resulting molecules of step ii) so as to form clusters, wherein only molecules with an expectation value below a first restricted threshold are merged; iv) identifying groups of clusters using local consideration resulting in related clusters; v) determining the relationship between related clusters in said groups; vi) analyzing said groups of the related clusters of step v), thereby creating a hierarchical organization of said molecules; b) determining the position of a selected molecule based on
- Figures 1A-1C are graphic representations of the average distribution of e-values according to three main algorithms for sequence comparison, BLAST, FASTA and SW respectively, according to an embodiment of the invention;
- Figure 2 is a schematic tree of the hirarchical organization of small G-protein family;
- Figure 3 is a schematic tree of the hirarchical organization of ATP-binding transporter family;
- Figure 4 is a schematic tree of the super- family of motor proteins
- Figure 5 is a schematic tree of the hirarchical organization of proteins that are involved in the biosynthesis of complex sugars
- Figure 6 is a schematic alignment of two protein sequences sw:acsa.acexy and sw:ythLrhoso;
- Figure 7 is a schematic tree of the hirarchical organization of methylases and the methyltransferases
- Figure 8 is a schematic alignment of two protein sequences sw:hmt1.yeast and sw:y912.haein;
- Figure 9 is a table showing PROSITE families for which performance was below 50% true positives
- Figure 10 is a table showing correlation of releases according to an embodiment of the invention.
- Figure 11A is a table showing clusters possibly related to cluster 4 of the immunoglobulin V region, according to an embodiment of the invention
- Figure 11 B is a table showing clusters possibly related to cluster 5 of the immunoglobulins and major histocompatibility complex, according to an embodiment of the invention
- Figure 12 is a schematic illustration tracing the formation of clusters
- Figure 13 is a schematic illustartion showing the association of groups according to an embodiment of the invention
- Figure 14 is a schematic illustration of the clustering alsorithm according to an embodiment of the invention
- Figure 15 is a table showing the largest clusters at the lowest confidence level 10 " °;
- Figure 16 is a graphic representation of the correlation of BLAST e-values and SW e-values
- Figure 17 is a schematic representation of the hierarchial organization of the small
- Figure 18 is a schematic map of the immunoglobulin superfamily according to an ambodimant of the invention.
- Figure 19 is a table showing the distribution of cluster sizes at various confidence levels
- Figure 20 is a table showing performance evaluation
- Figure 21 is a table showing the Largest clusters for which there was no corresponding family in PROSITE 13 or Pfam 1.0;
- Figure 22 is a table showing clusters belonging to the immunoglobulin superfamily.
- the present invention overcomes the drawbacks of the prior art by using transitivity in a controlled way (restricted transitivity) enabling organization of all the proteins in the database in a hierarchical organization.
- the present invention applied the following methods for comparing sequences against the databases: the Smith Waterman (SW), as disclosed in Smith, T. F. & Waterman, M. S. (1981), Comparison of Biosequenoes, Adv. in Appl. Math. 2, 482-489, the FASTA, as disclosed in Lipman, D. J. & Pearson, W. R. (1985), Rapid and sensitive protein similarity, Science 227, 1435-1441 and the BLAST, as described in Altschul, S. F., Carrol, R. J. & Lipman, D. J. (1990), Basic local alignment search tool, J. Mol. Biol. 215, 403-410. These papers are incorporated by reference in their entirety herein. The results are analysis at different thresholds.
- the analysis starts at a very high and restricted threshold, where the proteins are organized in small clusters.
- the threshold is increased in a step wise manner the clusters merge to give larger clusters, subfamilies and finally families.
- a protein belongs to a certain cluster it influences and is influenced by the other members of the clusters.
- the question whether a certain cluster will merge with another cluster depends on the similarity of all of the protein that are in the cluster. There could be a situation in which cluster A contains a protein showing similarity to cluster B and therefore tends to merge with cluster B, but the other members of cluster A which do not show the same degree of similarity will prevent this merge.
- the analysis of the invention starts from a very conservative classification, based on highly significant similarities, that consists of many classes. Subsequently, classes are merged to account for less significant similarities. Merging is performed via a two phase algorithm. First, the method identifies groups of possibly related clusters using local considerations, as will be described below. Subsequently, a global test is applied to identify nuclei of strong relationships within these groups of clusters, and clusters are merged accordingly. This process is repeated at varying levels of statistical significance, obtaining a hierarchical organization of all proteins.
- the resulting classification splits the protein space into well defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Different indices of validity were tested to assess the quality of the classification. When compared with the protein families in PROSITE and Pfam databases, the classification has detected between 64.8% and 88.5% of the proteins in these families, as well as many new clusters which do not match any protein family in these databases. The hierarchical organization reveals finer subfamilies that make up known families of proteins as well as many novel relations between protein families.
- the method of the present invention is carried out by software or software means (data) executable on computing (data processing) means, such as a computer (PC) or similar data processor, microprocessors, embedded processors, microcomputers, microcontrollers, etc.
- computing data processing means, such as a computer (PC) or similar data processor, microprocessors, embedded processors, microcomputers, microcontrollers, etc.
- the procedure starts from creating a neighbors list for each sequence in the swissprot database in each of the following methods: the Smith Waterman (SW), the FASTA and the BLAST. A numerical normalization is applied first to all methods, so they are all on comparable scales. Then, only statistical significant similarities are maintained in these lists. These lists induce agraph to obtain a hierarchical organization of all of the proteins that in the database.
- the vertices of this graph are the protein sequences. Edges between the vertices are weighted with weights that reflect the distance or dissimilarity between the corresponding sequences, i.e. high similarity translates to a small weight (or distance).
- the weight of the directed edge from A to B one compares A against all sequences in the swissprot database, and obtains the distribution of its score.
- the weight is taken as the expectation value as disclosed in Atschul, S. F., Boguski, M. S., Gish, W. G. & Wooton, J. C. (1994) Issues in searching molecular sequence databases. Nature Genetics 6, 119-129 of the similarity score between A and B, based on this distribution. This is a statistical estimate for the number of occurrences of the appropriate score at a random setup, assuming the existing amino acid composition. When the similarity score is statistically insignificant, the corresponding edge is discarded.
- an edge among sequence A and B indicates that the corresponding proteins are likely to be related.
- the weight of an edge is defined as the minimum associated to it by any of the three methods, to capture the apparently strongest relation.
- This graph encodes much of the fundamental properties of the sequence space. It reflects the idea that interesting homologies among proteins can be deduced by transitivity. The transitive closure of the similarity relation among the proteins shown by the graph, splits the space of all protein sequences into connected components or clusters. These are proper subsets of the whole database wherein every two members are either directly or transitively related. These sets are maximal in this respect and cannot be expanded. Thus they offer a self-organized classification of all protein sequences in the database. These connected components can be expected to correlate with known biological families. The results suggest that these connected components are indeed very informative and robust.
- this is a directed graph, so it is not necessarily symmetric. Specifically, it may (and does) happen that there is an edge from protein A to protein B, but none in the reverse direction. Furthermore, even if both edges exist, their weights may differ. Therefore, our notion of a component is that of a strong connected component.
- the partition into strongly connected components is thus more refined then the partition into connected components.
- This analysis can be performed at different thresholds, or confidence levels, to obtain an hierarchical organization. Several connected components of a given threshold may fuse together at a more permissive threshold. The analysis starts at the
- clusters that were found are meaningful. Some correspond to well known families, but many others correspond to less studied families. There are clusters that consist exclusively of unknown proteins or hypothetical proteins.
- Table 1 distribution of clusters by their size at each confidence level. Table 1 shows the distribution of clusters by their size at the different confidence levels. At each level, the universe of all proteins splits into connected components (clusters). These clusters become larger and coarser with the decrease of confidence levels. Consequently, the number of isolated proteins (clusters of size 1) decrease. As confidence level decrease to 10 " ° There is a sharp decline in the number of midsize clusters. Chance similarities tend to blur the picture, and cause an "avalanche", where (possibly unrelated) many families are joined to few giant clusters.
- Figs. 1A-1C These figures are based on the following main algorithms for measuring similarity between protein sequences : the BLAST, the FASTA and the SW respectively. These methods are in daily use by biologists, for comparing sequences against the databases. Though SW tends to give the best results, it is not uncommon that FASTA or BLAST are more informative as disclosed in Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases. Protein Science 4, 1145-1160 therefore the results of the three systems are being incorporated in order to achieve maximum sensitivity. For example SW is being used with the BLOSUM50 scoring matrix while FASTA and BLAST are being used with the BLOSUM62 scoring matrix in order to achieve better identification of remote homologous.
- Another example is when searches are strongly biased because the amino acid composition of the query sequence differs markedly from the overall average composition.
- a case in point is the effects of low complexity segments within the sequence as disclosed in Altschul, S. F., Boguski, M. S., Gish, W. G.& Wooton, J. C. (1994). Issues in searching molecular sequence databases. Nature Genetics 6, 119-129. Therefore, the results of BLAST (which uses the SEG program) as disclosed in Wootton, J. C. & Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149-163 in amino acid are being used for excluding low complexity segments from the query sequence.
- FASTA and SW are mostly due to the different scoring matrices that are being used and can be corrected by multiplying the original score by the relative entropy of the two matrices as described in Altschul, S. F. (1991 ) Amino acid substitution matrices from an information theoretic perspective. JMB 219, 555-565.
- SW and BLAST may be due to approximations in estimating the parameters ⁇ and K as described in S. Karlin & S. F. Adtsdiul. (1990).
- Figs. 1A - 1C show the average distribution of expectation values over the entire SWISSPROT database, for BLAST, FASTA and SW respectively.
- the graphs in Figs. 1A-1C naturally suggest a threshold for each method.
- the distribution drawn a log-log scale is nearly linear, at low expectation values, but starts a rapid increase at a certain value. This value is set to be the threshold.
- the thresholds for SW, FASTA and BLAST are set at 0.1 , 0.1 and 10 "3 respectively. An edge from vertex A to vertex B is maintained only if a significant score is obtained on comparing the corresponding proteins.
- SW or FASTA yield an expectation value ⁇ 0.1 or BLAST'S expectation value is ⁇ 10 "3 .
- a major difference between BLAST and SW/FASTA is that BLAST charges no gap penalties. Consequently, BLAST tends to overestimate the statistical significance of alignments.
- the method of the present invention counters this behavior of BLAST by selecting the edges asymmetrically as shown in figs. 1A-1C. While this property may help BLAST reveal significant similarities that the other methods miss (as described in Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases. Protein Science 4, 145-1160) it could also bring to highly fragmentary alignments that cannot be considered biologically meaningful.
- the method of the present invention ignores those BLAST scores that come from a large number of HSPs (high scoring pairs), where the MSP (maximal segment pair) is insignificant. Finally, even if the comparisons between proteins A and B fail to satisfy the previous criteria, the edge from A to B is maintained when all three methods yield an expectation value ⁇ 1.
- Figs. 2 and 3 These figures are examples of hierarchical organization within known families. This organization is based on the information extracted while moving across the different levels of the tree which means scanning the hierarchy over all levels.
- Fig. 2 describes the small G-protein/Ras super family. This family of proteins is composed of the sub-families: ras, rab, ran, rho, ral, and smaller sub-families.
- Fig. 2 depicts the relations within this family, based on the hierarchical organization obtained by the method of the present invention. A total of 366 proteins, all belonging to the small G-protein super-family, are presented.
- Small clusters which correspond to subfamilies, are formed at the high levels of confidences, and fuse to larger clusters, when the threshold is raised.
- this family merges with the ADP ribosylation factors family and guanine nucleotide-binding proteins, all of which are GTP-binding proteins, to form one cluster.
- Fig. 3 describes the ATP-binding transporters family. Transporters are membranous elements which provide the mechanism by which components cross the lipid bilayer within and from the environment of cell compartments. The large variety of components, environmental conditions and organisms make this super-family very complex and rich. This diverged family is another example for which an hierarchical organization is proposed.
- the sub-classification distinguishes between amino-acids transporters, oligopeptide transporters, metal transporters, multi drug resistance proteins, and many more subgroups.
- 75 are hypothetical transporters which can be classified based on this organization, according to their position in the tree.
- Fig. 4 describes the hierarchy of the super-family of motor proteins.
- Each circle indicated by the numeral 100 represents a connected component at threshold 10 "35 .
- Circles' radii are proportionate to the component's size.
- the component's size appears next to the corresponding circle.
- the drawn edges appeared upon lowering the threshold to 10 "30 .
- the letters A-D indicate the order of transitivity indicated by the vertical line 101.
- Each cluster is referred to by its order of transitivity, A-D, and its position from left to right.
- Clusters which consist solely of hypothetical proteins are indicated by a second superimposed dashed circle 102 .
- Fig. 4 demonstrates how, in some cases, the connection between functionally related proteins is revealed only through the connection with hypothetical proteins.
- One such example is the connection between the myosins and the kinesins.
- FIG. 5 describes the hierarchy of proteins involved in the biosynthesis of complex sugars, where each circle represent a connected component at threshold 10 "10 , and the drawn edges appeared upon raising the threshold to 10 "5 . Indeed, a relation between those families is established thanks to the raising of the confidence level from 10 "10 to 10 ⁇ 5 .
- the connection is established via hypothetical proteins, based on weak alignments. However, the basic biological feature which characterizes all these proteins, makes the connections inevitable.
- Fig. 6 describes the alignment of the sequences sw:acsa-acexy and sw:yth1-rhosom, (which correspond to sets A1 and B2 respectively in Fig. 5).
- Fig. 6 shows 24% identity and this demonstrates the weak alignment achieved when comparing protein sequences, where there are similar biological features connecting the two sequences.
- Fig. 5 shows the connection of the sequences classified to sets A1 and B2.
- FIG. 7 describes the hierarchy of proteins involved in the methylase, methyltransferase family where each circle represents a connected component at threshold 10 "10 , and the drawn edges appeared upon raising the threshold to 10 "5 .
- This family is another example for the importance of hypothetical proteins as linkage proteins. Through such links a natural connection between related biological families is established, as demonstrated for methylasels and methyltransferases.
- the Y-axes of the graph represents 11 orders of transitivity (labeled A-K).
- A1 , B1 , B2,C1 At the bottom end (A1 , B1 , B2,C1) as well as at the top end (J2) methylases and methyltransieraser are common. Many hypothetical proteins are scattered within these clusters. Some clusters contain exclusively hypothetical proteins (E2, H1 , 12), The connection between the two ends of this graph is made through such clusters (H1.I2).
- Fig. 8 describes the alignment of sequences sw:hmt1-yeast and sw:y912-haein that correspond to sets G1 and H1 respectively, in Fig. 7.
- Fig. 8 shows 30% identity which demonstrates the weak alignment achieved when comparing protein sequences, where there are similar biological features connecting the two sequences.
- Fig. 7 shows the connection of the sequences classified to sets G1 and H The connections are based on very sparse pairwise alignments as compared with Fig. 8, and thus raise a reasonable doubt on the biological significance. Yet, proteins at the two ends of the graph of Fig. 7 exhibit close biological function, therefore verify the validity of these connections.
- HMM model derived from multiple alignment of related proteins, which is being improved iteratively by searching for more related sequences in the database. These are iteratively incorporated into the model, till convergence. In each iteration the resulting alignment is traced manually, to avoid misalignments.
- Protein based analysis can be divided into two categories.
- the first category includes methods that employ alternative representations of protein sequences, e.g. their di peptide composition (van Heel et al. 1991 , Ferran et al. 1994) or combination of compositional properties and other physical/chemical properties (Hobohm and Sander 1995).
- the alternative representations induce measures of similarity/dissimilarity between complete protein sequences. These measures can in turn, be used (van Heel et al. 1 , Ferran et al. 1994) to classify the sequences into (usually a fixed number of) clusters.
- the second category includes methods that draw directly on pairwise comparison, e.g. (Gonnet et al.
- an analysis which identifies groups of related proteins in databases of protein sequences is valuable. It may amplify the outcomes of an exhaustive pairwise comparison. When close hits are already grouped together based on their mutual similarity, this may highlight a similarity with a group which could otherwise be missed by a simple manual scanning. Moreover, if several groups are found related to the query sequence, this may indicate the existence of several distinct functional/structural domains. If all groups share the same region of similarity, this adds insights about the relations between the different groups. In some cases, this may suggest that the groups belong to the same family or superfamily. Some of this information, but not all of it, can also be inferred from comparisons with domain databases.
- domain-based databases usually offer a lot of detailed information about domains and the domain structure of proteins, through multiple alignments, and schematic representations of proteins. This information is of great value for biologists. Moreover, unlike domain based methods, protein-based methods can be applied to all protein sequences and are more easily automated.
- the present invention belongs to the second category. It draws on pairwise similarities and looks for strongly connected sets of proteins. It applies a moderate version of transitive closure, in an attempt to eliminate chance similarities and avoid indirect multiple-domain-based connections.
- clusters are merged to form bigger and more diverse clusters.
- the algorithm operates hierarchically, each step considering weaker connections in addition to the connections that are already accounted for.
- a statistical test is applied in order to identify and eliminate suspicious/problematic connections as well as possibly false connections between unrelated proteins.
- the output is thus an hierarchical organization of all protein sequences.
- the algorithm incorporates no further biological information and uses only the similarity scores that are provided by standard methods.
- the space of all protein sequences is represented as a directed graph, whose vertices are the protein sequences. Edges between the vertices are weighted with weights that reflect the dissimilarity between the corresponding sequences, i.e. high similarity translates to a small weight.
- To compute the weight of the directed edge from A to B one compares A against all sequences in the SWISSPROT database, and obtains a distribution of scores. The weight is taken as the expectation value (Altschul et al. 1994) of the similarity score between A and B, based on this distribution. This is an estimate for the number of occurrences that the appropriate score could have been obtained by chance, i.e.
- This graph has been constructed, using the common algorithms for protein sequence comparison; Smith Waterman dynamic programming method (SW) (Smith & Waterman 1981), FASTA (Lipman & Pearson 1985) and BLAST (Altschul et al. 1990).
- SW Smith Waterman dynamic programming method
- FASTA Altschul et al. 1990
- the SW algorithm was run with the BLOSUM62 matrix (Henikoff & Henikoff 1992) and gap penalties of -12,-2 using the Bioccelerator hardware (Compugen) or the 'ssearch' program which is part of the FASTA 2 package by Bill Pearson.
- FASTA was run using the 'fasta' program with the BLOSUM50 matrix (Henikoff & Henikoff 1992) and gap penalties -14,-2 (the default setting).
- the procedure starts by creating a list of neighbors for each sequence, based on all the three methods. In order to place all three methods on comparable numerical scales, a numerical normalization is applied first to all methods. Then, only statistically significant similarities are maintained in these lists. Finally, the weight of an edge is defined as the minimum associated to it by any of the three methods, to capture the apparently strongest relation.
- Fig. 16 is a graphic representation of the correlation of BLAST e-values and SW e-values.
- the fitness of the linear line is even better when the line is of the form a * x + b.
- Figs. 1A-1C shows the distribution of expectation values over the entire SWISSPROT database, for SW, FASTA, and BLAST, based on the neighbors lists of all protein sequences in the SWISSPROT database.
- the distribution may be thought as the average distribution of expectation values, for a 'typical' protein sequence as a query.
- the graphs in Figs. 1A -1C naturally suggest a threshold for each method.
- the distribution drawn in a log-log scale is nearly linear at low expectation values, but starts a rapid increase at a certain value.
- the steep slope indicates a rapid growth in the number of sequences that are unrelated to the query sequence (high expectation values).
- the distribution may differ from one sequence to another, and accordingly the threshold may change.
- the number of neighboring sequences of a single protein is usually not enough to deduce a reliable threshold. Only when the distributions are averaged then a general threshold is obtained, which can be used a s a guideline.
- the threshold was set at the value where the slope rapidly changes.
- the thresholds for SW, FASTA and BLAST are set at 0.1 , 0.1 and 10 "3 respectively.
- An edge from vertex A to vertex B is maintained only if a significant score is obtained on comparing the corresponding proteins. Namely, if either SW or FASTA yield an expectation value ⁇ 0.1 or BLAST'S expectation value is ⁇ 10 "3 .
- BLAST does not count/penalize for gaps
- This heuristic is a very good approximation to the SW algorithm, whose main advantage is in its speed. It is, however, inferior to the rigorous SW algorithm.
- the old version of BLAST occasionally detects similarities that are missed by SW (e.g. for the Glucagon precursor family, and the H'-transporting ATP synthase [Pearson, 1995]).
- BLAST defines the similarity based on one or more high-scoring segment, pairs (ungapped local alignments), and the significance is assessed by applying Poisson or sum statistics (Altschul et al. 1994).
- BLAST tends to overestimate the statistical significance of fragmented alignments, and this behavior of BLAST is countered by the above asymmetry in selecting the edges. While this property may help BLAST reveal significant similarities that the other methods miss (e.g. Pearson 1995), it is cautious to be aware of highly fragmentary alignments that cannot be considered biologically meaningful. Therefore, those BLAST scores that come from a large number of HSPs (high scoring pairs), when the MSP (maximal segment pair) is insignificant are ignored. Finally, even if the comparisons between proteins A and B fail to satisfy the previous criteria, the edge from A to ⁇ is maintained when all three methods yield an expectation value ⁇ 1.
- Multi domain proteins can create undesired connection among unrelated groups; ii) Overestimates of the statistical significance of similarity scores may bias decisions. Naturally, chance similarities become more abundant as significance levels decrease.
- transitivity should be applied restrictively.
- transitivity is to be viewed as a force that attracts sequences, then it should be countered by some
- the approach is to begin by eliminating all edges of weight below a certain, very high, significance threshold (i.e. low expectation value). This operation splits our graph to many small components of strong connectivity. In biological terms, we split the set of all proteins into numerous small groups of closely related proteins, which correspond to highly conserved sub families.
- the threshold is raised, in a stepwise manner, and more relaxed statistically significant similarities are taken into account. In so doing, several clusters of a given threshold may merge at a more permissive threshold. However, this process is closely monitored and a merge is allowed only when strong statistical evidence is found for a true connection among the proteins in the resulting set.
- the transitive closure of the similarity relation among proteins splits the space of all protein sequences into connected components or clusters. These are proper subsets of the whole database wherein every two members are either directly or transitively related. These sets are maximal in this respect and cannot be expanded. Thus they offer a self-organized classification of all protein sequences in the database.
- the threshold is set at the very stringent significance level of 10 "100 . Similarities which are reported as significant above the level of 10 ⁇ 100 are very conserved and stretch along at least 150 amino acids. Thus, neither chance similarities, nor connections based on a chain of distinct common domains in multi-domain proteins occur at this level . The resulting connected components can be safely expected to correlate with known highly conserved biological subfamilies.
- this is a directed graph, and hence is not necessarily symmetric. Specifically, it may happen that there is an edge from protein A to protein B, but none in the reverse direction (a directed graph is strongly connected if for every two vertices there is a directed path from x to y as well as from y to x.) Furthermore, even if both edges exist, their weights usually differ. Therefore, the notion of a component is that of a strongly connected component. The partition into strongly connected components is clearly more refined that the partition into connected components.
- the procedure is recursive. That is, given the classification at threshold T, a method for deriving the classification at the next more permissive level should be given, that is 10 5 - T.
- the start is from the basic classification at 10 "100 .
- Figure 14 describes the clustering algorithm.
- Phase 1) Identify pairs of clusters that are considered as candidates for merging. Decisions are made based on the geometric mean of the pairwise scores of the connections between the two clusters. If this mean exceeds a specific threshold then the cluster is accepted as a candidate, and enters a pool of candidates. Otherwise it is rejected (denoted in the figure by 'X').
- Phase II) pairwise clustering is applied to identify groups of clusters which are strongly connected. At each step the two closest groups are chosen from the pool and merged provided that the geometrical mean of all pairwise similarities pass the threshold. Otherwise they stay apart (denoted by dashed line).
- the algorithm runs in two phases. First groups (“pools") of clusters that are considered as candidates for merging are identified and marked (see Fig. 14). A local test is performed where each candidate cluster is tested with respect to the cluster which "dragged" it to the pool, to check their mutual similarity.
- the geometric mean of all pairwise scores of protein pairs one of which from P and the other from Q are calculated. Unrelated pairs are assigned the default (insignificant) e-value of 1.
- the geometric mean reflects the distribution of pairwise connections between the two clusters, such that random or unusual connections have only a little effect.
- VT more significant
- the level of confidence in the reliability of the connection clearly decreases as T increases.
- the Quality of the P - Q connection is defined as minus the log of the geometric mean. This quantity ranges between 0 and 100, and the higher it is, the more significant the connection.
- a variant of a pairwise clustering algorithm is carried out. This algorithm successively merges only pairs of clusters that pass the above test and are not suspected as representing chance or domain-based similarities.
- the two closest clusters are chosen (based on the quality of their connection) and merged if their similarity (as quantified above, and based on all pairwise similarities between the new formed clusters) is more significant than the threshold.
- the process stops and the final clusters are defined (Fig. 14) when the similarity of the next closest clusters do not pass the threshold.
- Figure 19 shows a table showing the distribution of cluster sizes at various confidence levels.
- the set of all proteins splits into clusters, which merge to form larger and coarser clusters as the confidence level decreases.
- the number of isolated proteins diminishes as well.
- the lowest significance level there are 10.602 clusters, of which 4435 contain at least 2 members. 1006 clusters have size 10 and above.
- the number of clusters (of size bigger than 1) at each level of confidence ranges between 4228 and 5543. How many clusters should there be ? A lower bound for this number is provided by the number of different folds, since it is generally expected of members of the same cluster to have similar folds. The current estimates place the number of folds between several hundreds and one thousand (Wang 1996, Chothia 1992). However, the same fold is usually adopted by few different superfamiiies which share little or no sequence similarity. Therefore, in a sequence based analysis it should probably be expected that these superfamiiies correspond to different clusters, possibly related/connected. Moreover, superfamiiies may consist of several families, sometimes with as little as few percentages of sequence identity.
- Clusters are ordered in decreasing order of size
- the order of transitivity within each cluster is defined as follows: take one of the proteins in the cluster as the cluster's seed, The seed's order of transitivity is 0. Its neighbors are of order 1. Additional proteins that are neighbors of 1st order proteins, are of order 2, etc. The definition does not depend on the choice of the seed protein, (iii) The family description states the main feature of the member proteins.
- the quality of the classification U is evaluated in terms of the reference classification A, by means of their mutual agreement .
- the two classification can be either "hard", i.e. each protein is classified to exactly one group, or "soft” where each protein can be classified to more then one group.
- Pfam and PROSITE are soft, while the present is a hard classification .
- Gracy and Argos have proposed a procedure for performance evaluation.
- Each class a E A is associated with the group b E B which maximizes the quantity tp-fp-f n.
- true positives (tp) are given by /anb/ - 1
- false positives (fp) are given by ib ⁇ al
- false negatives (fn) are given by la ⁇ bl.
- Quality is defined by the percentage of the true positives 100 - tp/tp+fp+fn.
- each group a E A is associated with one or more groups b from B, where each group b satisfies tp > fp .
- a group b E B is a relative of a group a E A if more than 50% of its members are also members of a (see figure Fig. 13).
- For each group a E A all its relatives b in B are identified. The union of all the relatives of a is denoted by b a .
- a protein is misclassified by classification B if it is a member of a missed by b a (false negative), or is a member of b a ,but not a member of a itself (false positive).
- the conjunction of b a and a defines the group of proteins which are correctly classified.
- Figure 13 schematically shows association of groups in classification B with groups in the reference classification A. Groups B I- B 4 are relatives of the ellipse (group of A), while groups B5 and B6 are not.
- the quality Qset of the classification for the group a is defined by the percentages of the true positives (proteins which were correctly classified) out of the total number of proteins in a and b a, i.e. 100.a_n_b a a U b a This measure accounts for both error types. This procedure is repeated for every group a E A, and the total percentage of true positives is given by the average overall groups a E A.
- Another quality index is also defined, that accounts for the number of clusters which are relatives of a, and decreases as this number increases. This is done by first subtracting the number of relatives of a from the number of true positives, then calculating the percentages out of the total number of proteins in a and b a , i.e. 100. la n ba - number of relatives of a + 1 a U b a This way a family of size N which is mapped to a single cluster of size N has quality of 1 (or 100%), while a family of size N which correspond to N singletons in the present classification has quality 0.
- the fourth index of quality is given by the quantity tp/(tp+fn), which is the portion of the reference family a which is in the set Rel a .
- This measure does not take into account the false positives. The reason it is used is that not all false positives are indeed false positives (as shown in the next sections). This way all false positives are assumed to be potential related sequences. Obviously, this is not true. However, it is useful as an upper bound on the quality of the classification, assuming that at least some false positives are essentially potential related sequences. It is denoted by Qupper-bound A consistency test was proposed by Krause and Vingorn (Krause & Vingorn 1998). Although their test can be used for rough self-validation, it is less useful for assessing the quality of a new classification, with respect to a reference classification.
- the present classification was compared with two domain based classifications: PROSITE and Pfam.
- the present classification contains 52,205 proteins of the SWISSPROT database, release 33, classified to 10,602 clusters (at the lowest level of the analysis), of which 1006 are of size 10 and above (see Fig 19).
- the PROSITE database release 13 (released/associated with the SWISSPROT 33 database) contains 24156 proteins, characterized by 1151 different signature patterns. It is very common in PROSITE that the same family is characterized by few signature patterns. Therefore, all patterns that are documented the same in PROSITE are considered to be the same family. The exceptions are those who never appear together in the same protein though documented the same. Also considered different (for the purpose/in the scope of assessment) are those groups that differ by more than 10 proteins.
- PROSITE e.g. INTEGRIN RETA in EGF, POU-1 and POU-2 in HOMEOBOX.
- 1151 signatures characterize 874 protein families and domains, of which 600 are of size 10 and above.
- the Pfam database, release 1.0 (associated with SWISSPROT. 33) contains 15604 proteins, classified to 175 families, of which 172 are of size 10 and above.
- the second measure Qset is found to be more appropriate. This means that no single cluster is associated with one domain family.
- the different clusters may correspond to different subfamilies, but the overall structure is still detectable through connections between clusters (see section 'possibly related clusters').
- An example is the PROSITE family AA TRNA TIGASE.I family.
- the above procedure may result in an over-strict measure, and a few factors that may suppress the performance need some careful inspection.
- One such factor is the amount of false positives.
- the supposedly false positives are actually true positives.
- short fragments which are undoubtfully identified as belonging to a specific family may be considered false positive simply since they are too short and do not share the domain which is common to all other members in the family.
- hypothetical proteins which share significant sequence similarity with a family are not necessarily false positives.
- Even proteins which are documented as belonging to a family may be counted as false positives simply since they do not have exactly the family signature pattern, but rather a slightly modified one.
- cluster 27 has 139 proteins out of which 132 has the actins and actins like signature.
- cluster 15 has 13 proteins which do not have the cytochrome p450 signature, 10 of which are variants of cytochrome p46O, 2 proteins are thromboxane-a synthase, and one is trans, -cinnaniate 4-monooxygenase., all are documented to be part of the cytochrome p45O family, and indeed, show a strong similarity with cytochromes p460. False positives of these types are very common in the present clusters. Should they be counted as false positives ? Obviously, some hits are real false positives, but there is no automatic way to detect which false positives are biologically meaningless and which are not. It can only be assumed that some false positives can be ignored.
- ran family which is a subfamily of the small G-proteins.
- the ran proteins are classified to the same cluster as the ras/ras-like/rab proteins (cluster 12).
- cluster 12 the ras/ras-like/rab proteins
- the ras/ras-like/rab proteins are not characterized by the same signature pattern (nor are they characterized by any other signature pattern in release 13.0 of PROSITE), then they are considered as false positives. While this may be correct for the ran subfamily, by definition, the real functional relationship with all the other proteins in this cluster may suggest that the definition of false positive is somewhat too strict.
- the results of the evaluation procedure with Pfam as the reference database lead to approximately the same conclusions: good performance for protein families, but short motifs are not detected well.
- a major aspect of the hierarchical organization is that clusters at a given threshold may merge at a more permissive threshold. This reflects the existence of subfamilies within a family, or families within a superfamily.
- Figure 12 is a schematic illustration tracing the formation of clusters.
- Cluster 170 the transport system permease proteins.
- the cluster splits into few subclusters.
- Each circle stands for a cluster at threshold 10 "10 .
- Circles' radii are proportionate to the clusters sizes (numbers indicate clusters' sizes).
- the drawn edges appeared upon changing the threshold from 10 ⁇ 10 to the more permissive 10 "6 .
- Edge widths are proportionate to the number of connections between the corresponding clusters.
- Cluster 170 at level 10 " ° with 46 proteins consists of transport system permease proteins. These proteins participate in multicomponent transport systems in bacteria. Specifically, they are the integral inner-membrane proteins which translocate the substrate across the membrane.
- the cluster decomposes into four subclusters at level 10 "10 , which form a clique as in Fig. 12. These smaller subclusters correspond to the lactose/maltose transport system lacG/malG, the lactose/maltose transport system lacF/malF. the phosphate transport system, and other transport systems (of sulfate, molybdenum, spermidine and putrescine).
- the subgroups of lacG/malG and lacF/malF form already at level 10 "25 .
- these proteins may be classified either to malG or to malF.
- Hierarchical organization within protein families and superfamiiies The hierarchical organization also suggests classification within known, families. This classification is suggested by scanning the hierarchy over all levels, as illustrated for the small G-protein/Ras superfamily.
- the ras gene is a member of a family that have been found in tumor virus genomes and are responsible for the viruses' carcinogenic effect. In most cases this viral oncogene is closely related to a cellular counterpart, called proto-oncogene. infection by a retrovirus that carries a mutant form of the ras gene (ras oncogene), or mutation, can cause cell transformation. Indeed, mutations in ras gene are linked to many human cancers.
- the cellular ras protein binds guanine nucleotide and exhibits a GTPase activity. It participates in the regulation of cellular metabolism, survival and differentiation.
- GTPase activity In the last decade many additional proteins that are related to ras were discovered, all of which share the guanine nucleotide binding site. They are referred to as the small-G-protein superfamily (Nuoffer & Balch 1994). This family of proteins has several subfamilies: ras, rab, ran, rho, ral, and smaller subfamilies. Like ras, these proteins participate in regulatory processes,such as vesicle trafficking (rab), and cytoskeleton organization (rho).
- Figure 17 is a schematic representation of the hierarchial organization of the small G-protein family. This family is composed of several subfamilies. A total of 229 proteins, combined in cluster 12 (level 10 " °), were grouped together into isolated sets at different levels of confidence, to form a natural subclassification of the family. This hierarchical organization is much enriched by combining possibly related clusters. The related clusters are connected by dashed lines.
- Fig. 17 the relations within this family are depicted based on the present hierarchical organization.
- AH were clustered into cluster 12 at the lowest level of significance 10 " °.
- the four main branches coincide with (I) rab subtypes (II) ras, ral and rap (III) rac, rho and cdc (cell division control proteins) (IV) ran.
- the linkage of rac to cdc and rho seems stronger than that between ran and rho or rab and rho.
- This proposed subdivision suggests a common root for all the subtypes, but splits them in a way that resembles the evolutionary tree of the small G-protein superfamily (Downward 1990).
- Cluster 29 consists of ADPribosylation factors family (APF) that are involved in vesicle budding, and of guanine nucleotide binding proteins from the sar subfamfly, whose members participate in a different type of vesicle budding. Most interestingly - G ⁇ proteins of heterotrimeric G proteins belong to the same cluster.
- ADPribosylation factors family ADPribosylation factors family
- Cluster 646 consists of the GTP-binding protein ERA and of thiophene/furan oxidation proteins (both being groups of GTP-binding proteins). This cluster and cluster 12 are related through the similarity of the thiophene/furan oxidation proteins and the ras subfamily.
- Cluster 461 GTP-binding proteins of the OBG family
- cluster 1400 hyperthetical small G-proteins
- the clustering algorithm automatically rejects many possible connections among clusters. This happens whenever the quality associated with this connection falls below a certain threshold (see Fig. 14). Many of these rejected connections are nevertheless meaningful and reflect genuine though distant homologies.
- the rejected mergers are referred to as possibly related clusters. In examining a given cluster, much insight can be gained by observing those clusters which are possibly related to it. Even though some of these connections are justifiably rejected, in particular at the lowest level of confidence (10 " °), many others do reflect structural/functional similarities, despite a weak sequence similarity.
- Based on the connections with possibly related clusters we can plot local maps (at this stage, mostly schematic) for the neighborhoods of protein families. These schematic maps can expose interesting relationships between protein families. Here we present a map for the immunoglobulins superfamily.
- cluster 4 immunoglobulin V region
- cluster 5 immunoglobulin V region
- Fig 11A shows clusters possibly related to cluster 4 (Level: 1e-0). Clusters are sorted by quality (i.e. minus the log of the geometric mean of similarity scores). Note that all clusters belong to the superfamily of immunoglobulins.
- Fig. 11A shows those clusters which are possibly related to cluster 4, ordered by their quality value. These clusters include proteins which are involved in aspects of recognition at the immune system via the variable regions.
- Fig. 11 B shows clusters possibly related to cluster 5 (Level: 1e-0). Only clusters with two members or more are shown, and are sorted by quality. Almost all clusters belong to the superfamily of immunoglobulins.
- Clusters 596 and 866 are probably unrelated (in italic).
- Clusters which are suspected to be unrelated appear in italic (one can validate the significance of
- the two sets of clusters are mostly disjoint, with the exception of cluster 1796.
- Fig. 18 is a schematic map of the immunoglobulin superfamily. Ail clusters related to clusters 4 and 5 are shown (these are referred to as directly related clusters), as well as other clusters which are indirectly related clusters (clusters that are possibly related to directly related clusters).
- each cluster is represented as a circle, whose radius is proportional to the cluster's size. Only clusters' numbers are given. See Fig. 22 for more information on these clusters.
- Clusters on the left contain an Ig V-region.
- the group on the right has an Ig C-region.
- the clusters in between share both regions (except for cluster 1468, that consists exclusively of V-region). There is also a direct connection between cluster 4 and cluster 5.
- connection is based on 226 pairwise similarities between cluster 4 and 5.
- all the similarities are due to a single protein in cluster 5, a T-cell receptor beta chain (PII364).
- PII364 T-cell receptor beta chain
- This protein has one V region aside of a C region, and the similarity with the 226 proteins in cluster 4 is limited to the V region.
- No other protein in cluster 5 has a V region. Note that despite these similarities, clusters 4 and 6 did not merge, and the connection was automatically rejected.
- Figure 22 is a table showing clusters belonging to the immunoglobulin superfamily. All clusters appear in the local map of the immunoglobulins (Fig. 18). This information depicts the 'geometry' of the protein space in the vicinity of the immunoglobulin superfamily as in Fig. 18. This map includes also indirectly related clusters, i.e. possibly related clusters of order 2 and above (related clusters of related clusters, etc). The map includes almost all protein families which belong or are related to the immunoglobulin superfamily (see Fig.
- cluster 373 isolated cluster of periplasmic pilus chaperones
- cluster 2172 isolated cluster of THY-1 membrane glycoproteins
- cluster 2363 B-lymphocyte antigen cd19.
- the present invention relates to the problem of identifying high order features and organization within the protein sequence space. It is an object of the invention to exhaustively "chart" proteins, preferably all proteins, and to automatically classify them into families. According to the invention, based on pairwise similarities, statistical regularities in the sequence space are sought, and, more concretely, clusters of protein sequences. According to the invention, in this high order organization most groups consist of sequences with a specific biological function or structure.
- a complete charting of the protein space is a daunting task and many difficulties are encountered.
- One preferably begins from well established statistical measures, in order to identify significant similarities.
- Great caution and biological expertise are needed to exclude connections which are unacceptable or misleading.
- the main difficulties stem from chance similarities among sequences and multi-domain-based connections.
- the sheer volume of data makes it necessary to develop automatic methods for such identification.
- the present invention addresses these major problems. According to the invention, one starts by creating, for each protein sequence, an exhaustive list of neighboring sequences. These lists take into account the scores of the three major methods for pairwise sequence comparisons. These three types of scores are jointly normalized and the lists are filtered. The link is maintained in the lists only when a significant relationship seems to exist. Then a two-phase clustering algorithm is applied to identify groups of related sequences.
- the method of the present invention employs an algorithm that can be described as a moderate version of the transitive closure algorithm. At each round of this process statistical information is gained on the relationships among current clusters. This information is then used to merge certain clusters, thus forming the next round of larger, coarser clusters.
- the algorithm starts from a very conservative classification, and is repeatedly applied, at varying levels of confidence, the input for each stage being the classification output at the previous stage.
- the classification consists of several thousand clusters, the largest of which contain several hundred members each. It is interesting to assess the effect of slowing down the transitive closure algorithm. Indeed, a straight forward application of the transitive closure (a.k.a. single-linkage clustering) algorithm leads to an avalanche as discussed above (details not shown). Already at confidence level 10 "20 , most of the space is made up of a small number of very large clusters. This avalanche is caused by chance similarities and chains of domain-based connections that cause unrelated families to merge into few giant clusters. A major ingredient of the new algorithm is the choice of rules for avoiding such undesirable incorrect connections and for preventing this collapse, and indeed most of them were eliminated. Some domain-based connections did escape the filters, and caused unrelated clusters to merge.
- the algorithm employed in the method of the present invention has turned out well defined groups which are strongly correlated with protein families and subfamilies when compared with the well accepted databases PROSITE and Pfam, the classification performed well for most of the families, though many of them were domain based rather than protein families. Many new clusters in the classification didn't have a match from either PROSITE nor Pfam.
- the hierarchical organization can indicate the existence of subfamilies within families, and the concept of possibly related clusters exposes distant relationships which reflect functional similarities. These relations offer a basis for sketching local maps near protein families. These connections can be of biological interest, and should be taken into account in the study of protein families. For example, an overall view of Fig. 17 or Fig. 18 implies that tracing the hierarchical organization of a cluster and/or its rejected merges can provide new information.
- Another aspect of the possibly related clusters is that this list can be viewed as a "soft clustering", where the same protein can participate in several different clusters, Proteins that are composed of several domains, some of which shared by proteins from different families, are naturally associated with more than one group. Therefore, though in one embodiment of the invention a multi-domain protein is classified to a single cluster, its multi trait feature is tractable if the relations with the other clusters are examined as well.
- Bairoch & Boeckman 1992 Bairoch, A Boeckman, B. (1992) The S ISS-PROT protein sequence data bank Nucl Acids Res 20, 2019-2022.
- Harpa ⁇ k Chothia 1994 Harpez, Y. k Chothia, C. (1994) Many of the immunoglobulin superfamily domains in cell adhesion molecules and surface receptors belong to a new structural set which ⁇ close to that containing variable domains. J. Mol. Biol. 238, 528-539.
- Needleman k Wunsch 1970 Needle an, S B. k Wunsch. C. D. (1970) A general method applicable to the search for similarities in the ammo acid sequence of two proteins. J. Mol. Biol 48, 443-453.
- Tatusov et al. 1997 Tatusov. R. L , Eugene, V. K, k David, J. L. (1997). A genomic perspective on protem families science 278, 631-637
- Watanabe k Otsu a 1995 Watanabe, H. k Otsuka, J. (1995). A comprehensive representation of extensive similarity linkage between large numbers of proteins Comp. App. Biosci. 11(2), 159-166.
Landscapes
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Health & Medical Sciences (AREA)
- Zoology (AREA)
- General Engineering & Computer Science (AREA)
- Wood Science & Technology (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biomedical Technology (AREA)
- Microbiology (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Biochemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Plant Pathology (AREA)
- Immunology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Saccharide Compounds (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU22970/99A AU2297099A (en) | 1998-01-29 | 1999-01-29 | An automatic method of classifying molecules |
JP2000529585A JP2002502064A (en) | 1998-01-29 | 1999-01-29 | Automatic method of classifying molecules |
EP99902797A EP1053460A4 (en) | 1998-01-29 | 1999-01-29 | An automatic method of classifying molecules |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US7297798P | 1998-01-29 | 1998-01-29 | |
US60/072,977 | 1998-01-29 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO1999039174A2 true WO1999039174A2 (en) | 1999-08-05 |
WO1999039174A3 WO1999039174A3 (en) | 1999-10-07 |
Family
ID=22110954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IL1999/000057 WO1999039174A2 (en) | 1998-01-29 | 1999-01-29 | An automatic method of classifying molecules |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1053460A4 (en) |
JP (1) | JP2002502064A (en) |
AU (1) | AU2297099A (en) |
WO (1) | WO1999039174A2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003072701A1 (en) * | 2002-02-28 | 2003-09-04 | Istech Co., Ltd. | A system for analyzing dna-chips using gene ontology and a method thereof |
WO2014197997A1 (en) * | 2013-06-14 | 2014-12-18 | University Of Guelph | Systems, methods, and computer program products for merging a new nucleotide or amino acid sequence into operational taxonomic units |
CN112988698A (en) * | 2019-12-02 | 2021-06-18 | 阿里巴巴集团控股有限公司 | Data processing method and device |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5007803B2 (en) * | 2007-03-09 | 2012-08-22 | 独立行政法人農業生物資源研究所 | Gene clustering apparatus, gene clustering method and program |
JP5278388B2 (en) * | 2010-07-06 | 2013-09-04 | ソニー株式会社 | Visualization method, visualization apparatus, and information storage medium related to biological material information |
US20210012858A1 (en) * | 2018-03-16 | 2021-01-14 | Kotai Biotechnologies, Inc. | Effective clustering of immunological entities |
JP6500144B1 (en) * | 2018-03-28 | 2019-04-10 | Kotaiバイオテクノロジーズ株式会社 | Efficient clustering of immune entities |
-
1999
- 1999-01-29 EP EP99902797A patent/EP1053460A4/en not_active Withdrawn
- 1999-01-29 JP JP2000529585A patent/JP2002502064A/en not_active Withdrawn
- 1999-01-29 WO PCT/IL1999/000057 patent/WO1999039174A2/en not_active Application Discontinuation
- 1999-01-29 AU AU22970/99A patent/AU2297099A/en not_active Abandoned
Non-Patent Citations (4)
Title |
---|
ALTSCHUL S.F. et al., "Issues in Searching Molecular Sequence Databases", NATURE GENETICS, February 1994, Vol. 6, pages 119-129, XP002920457 * |
GOTOH O., "Optimal Alignment Between Groups of Sequences and its Application to Multiple Sequence Alignment", CABIOS, 1993, Vol. 9, No. 3, pages 361-370, XP002920458 * |
LIPMAN D.J. et al., "Rapid and Sensitive Protein Similarity Searches", SCIENCE, 22 March 1985, Vol. 227, pages 1435-1441, XP002920456 * |
See also references of EP1053460A2 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003072701A1 (en) * | 2002-02-28 | 2003-09-04 | Istech Co., Ltd. | A system for analyzing dna-chips using gene ontology and a method thereof |
WO2014197997A1 (en) * | 2013-06-14 | 2014-12-18 | University Of Guelph | Systems, methods, and computer program products for merging a new nucleotide or amino acid sequence into operational taxonomic units |
CN112988698A (en) * | 2019-12-02 | 2021-06-18 | 阿里巴巴集团控股有限公司 | Data processing method and device |
Also Published As
Publication number | Publication date |
---|---|
WO1999039174A3 (en) | 1999-10-07 |
JP2002502064A (en) | 2002-01-22 |
AU2297099A (en) | 1999-08-16 |
EP1053460A2 (en) | 2000-11-22 |
EP1053460A4 (en) | 2003-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yona et al. | ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space | |
US20200143944A1 (en) | Methods and compositions for classification of samples | |
US7729864B2 (en) | Computer systems and methods for identifying surrogate markers | |
US20060111849A1 (en) | Computer systems and methods that use clinical and expression quantitative trait loci to associate genes with traits | |
Dumontier et al. | Armadillo: domain boundary prediction by amino acid composition | |
Schneider et al. | A DNA structural alphabet provides new insight into DNA flexibility | |
Jiang et al. | An approach for prioritizing disease-related microRNAs based on genomic data integration | |
WO2013142939A1 (en) | Methods for predicting and classifying event outcomes | |
WO1999039174A2 (en) | An automatic method of classifying molecules | |
Raes et al. | Gene duplication, the evolution of novel gene functions, and detecting functional divergence of duplicates in silico | |
Kelow et al. | A penultimate classification of canonical antibody CDR conformations | |
WO2021050760A1 (en) | Systems and methods for pairwise inference of drug-gene interaction networks | |
Mao et al. | Evaluation of inter-laboratory and cross-platform concordance of DNA microarrays through discriminating genes and classifier transferability | |
WO2023277932A1 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
Stefanini et al. | Yeast as a Model in High‐Throughput Screening of Small‐Molecule Libraries | |
Parks et al. | Microba’s community profiler enables precise measurement of the gut microbiome | |
Uslan et al. | The quantitative prediction of HLA-B* 2705 peptide binding affinities using support vector regression to gain insights into its role for the spondyloarthropathies | |
Thompson | Bioinformatic Analysis of Complex, High-Throughput Genomic and Epigenomic Data in the Context of CD4+ T-cell Differentiation and Diagnosis and Treatment of Transplant Rejection | |
Dunham | Modelling the structural, functional and phenotypic consequences of protein coding mutations | |
Pearngam | Improvement of selection criteria and prioritisation for neoantigen prediction | |
Song et al. | Applying multi-state modeling using AlphaFold2 for kinases and its application for ensemble screening | |
Baek et al. | Comparison of transcriptomic and phenomic profiles for the prediction of drug mechanism | |
Anjum et al. | Segmentation of genomic data through multivariate statistical approaches: comparative analysis | |
Chi | Statistical and Computational Methods in Epidemiological and Pharmacogenomic Studies: from Application to Method Development | |
Ramzan | Deciphering the genetic background of quantitative traits using machine learning and bioinformatics frameworks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
NENP | Non-entry into the national phase in: |
Ref country code: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1999902797 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1999902797 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 09601278 Country of ref document: US |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1999902797 Country of ref document: EP |