WO2000036489A9 - Pattern recognition oriented cluster analysis - Google Patents

Pattern recognition oriented cluster analysis

Info

Publication number
WO2000036489A9
WO2000036489A9 PCT/US1999/030175 US9930175W WO0036489A9 WO 2000036489 A9 WO2000036489 A9 WO 2000036489A9 US 9930175 W US9930175 W US 9930175W WO 0036489 A9 WO0036489 A9 WO 0036489A9
Authority
WO
WIPO (PCT)
Prior art keywords
objects
group
compound
pair
compounds
Prior art date
Application number
PCT/US1999/030175
Other languages
French (fr)
Other versions
WO2000036489A2 (en
WO2000036489A3 (en
Inventor
Xiaofeng Yang
Molly B Schmid
Jianpeng Shi
Donald Beik
Original Assignee
Microcide Pharmaceuticals Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microcide Pharmaceuticals Inc filed Critical Microcide Pharmaceuticals Inc
Priority to AU25902/00A priority Critical patent/AU2590200A/en
Publication of WO2000036489A2 publication Critical patent/WO2000036489A2/en
Publication of WO2000036489A3 publication Critical patent/WO2000036489A3/en
Publication of WO2000036489A9 publication Critical patent/WO2000036489A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the field of this invention relates to chemistry, biochemistry, statistical analysis and computer science.
  • it relates to methods and apparatus for discerning characteristic similarities among groups of objects; more particularly, similarities among chemical compounds based on biological activity.
  • Chemical space is a term often used to describe the universe of all possible chemical compounds.
  • the size of theoretical "chemical space” is enormous.
  • the number of potential compounds comprising chemical space is far larger than the number of compounds that can realistically be hoped to be synthesized.
  • Compounding the situation is the fact that small changes in molecules can often have profound effects on biological activity. Thus, finding that a particular molecule located at a particular point in chemical space is not biologically active does not necessarily mean that the entire region of chemical space around that molecule is likewise inactive.
  • protein space Within the universe of chemical space lies the sub-universe of "protein space.” Again, in theory, protein space is also very large. However, practically speaking, the size of protein space that is used by living organisms is relatively modest. Genomic studies suggest that, across the five kingdoms of living organisms (animals, plants, fungi, bacteria and algae), there are only between twenty and fifty protein "super families,” i.e., families of proteins containing amino acid sequences with significant similarities to sequences in other proteins. The implication of this is that the regions of sequence similarity should have similar folding patterns and, therefore, are likely to be evolutionarily related.
  • COMPARE a computer program developed for the National Cancer Institute. COMPARE correlates similarities among growth inhibition patterns of chemical compounds. Compounds are first tested against sixty human cancer cell lines in disease-specific panels. The growth inhibition patterns of the tested compounds are then subjected to COMPARE analysis, which generates predictions about similarities, usually with regard to mechanism of action or molecular target. See, e.g., K. D. Paull, et al., Journal of the National Cancer
  • COMPARE a "seed compound”; i.e. a compound of known activity, is tested against the disease-specific panels, and a "fingerprint" of the compound's activity, known as a "mean graph” is generated. The compounds to be compared are then also tested against the disease-specific panels and their respective mean graphs generated. COMPARE then ranks the compounds according to their mechanism of action or molecular target by calculation of Pearson correlation coefficients. COMPARE treats only one compound/target protein pair at a time and only compares raw data, i.e., a value corresponding to the activity of one compound against a panel of biological targets, with similar raw data of another compound.
  • Cluster analysis involves procedures for taking a group of objects and “clustering" them in subgroups which differ from other subgroups in meaningful ways.
  • chemical compounds a "meaningful way” that subgroups can be formed from a large group of chemical compounds is by their performance properties such as, without limitation, mode action against biological targets.
  • FIG. 1 The table 50 in Figure 1 is comprised of three objects, A 10, B 20 and C 30. Each object has interacted with each of four operators, M1 5, M2 15, M3 25 and M4 35. In table 50, each object-operator pair has an associated raw data value 40, representing the interaction between each object and each operator.
  • each object-operator pair raw data values By simply mapping each of the object-operator pair raw data values, as shown in Figure 2, the graph 60 for object A 10 appears similar to the graph 70 for object B 20, and dissimilar to the graph 80 for object C 30.
  • object A 10 and B 20 are similar, and object C 30 is dissimilar.
  • object A 10 and object C 30 appear nearly identical when their object-operator pairs are graphed using similarity scaled axes.
  • Graph A 90 and graph C 94 have a virtually identical shape in Figure 3, while graph B 92 is clearly dissimilar to both graphs A 90 and C 94.
  • a false similarity is determined between objects A 10 and B 20, while a true similarity between object A 10 and object C 30 is left undiscovered.
  • the results are generally displayed as dendograms, or tree graphs.
  • dendograms are prepared by using auxiliary methods such as Ward's method and a Euclidean distance metric. See, for example, Bates. J., Cancer Res. Gin. Oncol.. 1995, 121 :495, 497, Figure 2.
  • Figure 4 is an example of a simple dendrogram 100.
  • the resultant analysis is limited by the fact that the information contained within the dendogram 100 is not presented in a manner optimally conducive to direct, i.e., visual, human interpretation of the interrelationships among the objects.
  • One problem with this hierarchical method is when to stop, i.e., how to determine when clusters revealing optimal interrelationships between compounds have been generated.
  • agglomerative approach eventually all data points are joined into one large cluster; with the divisive approach all data points eventually end up as individual point clusters.
  • Another problem specific to the hierarchical method is that once an object is allocated to a particular cluster, that allocation, whether or not it is an optimal allocation, is irrevocable; that is, once an object joins a cluster it is never removed from that cluster and fused, or otherwise joined, with objects belonging to some other cluster.
  • the partitioning method of cluster analysis does not require that allocation of an object to a cluster be irrevocable. That is, objects may be reallocated if their initial assignments are found to be inaccurate, or otherwise unsatisfactory.
  • the partitioning method suffers from the general requirement that the number of final clusters be known and specified in advance; this limitation is a serious shortcoming if the goal is in fact to determine how many clusters exist in a particular group.
  • a method which can quickly, accurately and in a dynamic manner, i.e., a manner which permits reallocation of objects from cluster to cluster, as permitted by the partitioning technique but not the hierarchical technique, but which does not also require fore-knowledge of the final number of clusters, a limitation not present in the hierarchical technique, but required in the partitioning technique, extract from raw data the maximum amount of information available regarding the interrelationships among all objects in a particular group, including classification of the objects into correct subgroups, while at the same time minimizing the misclassification of objects into wrong subgroups. It would be further advantageous if the method is fully user interactive, scalable, flexible and robust. The present invention provides such a method.
  • the present invention relates to a method for evaluating sets of random objects and clustering, or otherwise grouping, them into groups based on predefined charactehstic(s) of the objects.
  • the invention relates to a method and apparatus for evaluating a group of chemical compounds and clustering them into sub-groups based on similarity of biological activity.
  • bacterial mutant strains are employed to evaluate a group of chemical compounds. The raw data obtained is assessed by the pattern recognition cluster analysis methods and apparatus of this invention to reveal compounds having similar biological activity.
  • a method for evaluating the similarity of biological activity in a group of chemical compounds using a panel of organisms having disparate gene expressions is yet another presently preferred embodiment of this invention.
  • the ability of the chemical compounds to affect the expression of the different products of the various genes is detected and provides a measure of the similarity of biological activity of the compounds within the group being tested.
  • group of compounds is meant as few as two compounds to as many as 10 9 compounds.
  • similar biological activity is meant compounds having a similar pattern of activity against a panel of biological targets, such as, for example and without limitation, a panel of proteins.
  • a similar pattern of activity may be manifested simply as a plus or minus; i.e., either a chemical has an effect against a protein in the panel or it does not.
  • two compounds display a "similar biological activity” if they are plus against the same proteins in a panel of proteins and minus against the same proteins in the panel.
  • a similar pattern of activity may also be a more complex similarity relating to such things as, without limitation, the biochemical target of the effect, the manifestation of the effect or the amount of the effect.
  • an effect may be manifested by, again without limitation, a change in cell phenotype or a change in the ability of a protein to perform it biological function.
  • Bacterial mutant strain or “mutant strain” refers to a strain of bacteria in which biochemical activity has been modified such that the bacteria exhibit a diminished or an enhanced level of activity with regard to a selected parameter when compared to normal bacteria of the same specie. Examples of such parameters are, without limitation, the ability of bacteria to grow at different temperatures, at different pHs, in the presence of different nutrients, etc.
  • a change in the level of activity in the presence of a chemical being tested indicates that the chemical is affecting the biomolecule either directly or indirectly; i.e. by interacting with the biomolecule itself of by interacting with another molecule on which the biomolecule relies to perform its function.
  • Gene expression is meant a process by which a living organism manufactures chemical products under the direction of a gene.
  • Gene expression can either be a wild type, i.e., the manufacture of a chemical produced by the organism in its natural state, or it may be engineered.
  • engineered is meant that genome of the organism is altered such that a gene which expresses a non-natural (for that specie) chemical is incorporated into the genome. Examples, without limitation, of genes which may be engineered into an organism's genome and which express chemicals that are readily detectable are the lux gene, which expresses the enzyme luciferase, and the cat gene, which expresses the enzyme chloramphenicol acetyl transferase.
  • gene expression also refers to an environment containing organisms which harbor the genes and which express selected chemicals. While growth inhibition of mutant strains and gene expression are presently preferred embodiments of this invention, it is understood that numerous other indications of similar biological activity including, but not limited to, other biochemical assays, other whole cell assays and the like are within the spirit and scope of this invention.
  • two or more objects of a set of objects are grouped into one or more groups, in a presently preferred embodiment, an object similarity score is generated for each pair of objects in the set of objects to be grouped, or clustered. Two or more objects are then assigned, or grouped, into one or more groups of objects.
  • the criteria for assignment of an object to a particular group is the object similarity scores generated for the pairs of objects to be clustered.
  • the objects of an established group are ordered.
  • the criteria for ordering the objects of a group are the object similarity scores for the pairs of objects of the group.
  • a group similarity score is generated for each pair of groups of objects.
  • the groups of objects are then ordered, based on the group similarity scores.
  • a pattern matrix is generated for a set of groups of objects.
  • the generated pattern matrix provides a visual representation of the grouping and similarity, or relative similarity, of the grouped objects represented in the respective matrix.
  • a general object of the invention is to provide a method and apparatus for optimally clustering, or grouping, a group of objects.
  • the grouping is based on the objects' similarity scores with each other.
  • the invention provides a method and apparatus for optimally clustering a group of compounds, based on the similarity of the compounds' interactions with various bacterial mutant strains or gene expressions.
  • a further general object of the invention is to provide a method and apparatus for displaying the results of a clustering of groups of objects.
  • a pattern matrix is generated which provides a visual representation of both the grouping and the similarity, or relative similarity, of the grouped objects represented in the respective matrix.
  • the methods and apparatus disclosed are in fact multi-dimensional. That is, clustering of objects can be performed with relation to more than one environment. This can be accomplished using a three- dimensional analysis, in which the pattern matrix generated will likewise be three- dimensional.
  • the X-axis of a plot, or pattern matrix could be chemical compounds
  • the Y-axis could be chemical compounds
  • the Z- axis could be respective correlation values.
  • the clusters of objects could be visualized as peaks wherein the strength of correlation would be indicated by the height of the peak.
  • Table 1 depicts a presently preferred embodiment of groups of correlation ranges.
  • Table 2 depicts a presently preferred embodiment of a default correlation value range and a default lower correlation value for each of a default number of color groups.
  • Figure 1 depicts an exemplary object-operator table.
  • Figure 2 is a representative graph of the object-operator interrelationship from the table of Figure 1 , without any correlation or similarity measurement scaling.
  • Figure 3 depicts representative graphs of the object-operator interrelationships from the table of Figure 1 with similarity measurement scaling.
  • Figure 4 depicts an exemplary dendogram of a resultant cluster analysis.
  • Figure 5 depicts a presently preferred embodiment of a pattern recognition oriented cluster method.
  • Figure 6 depicts a presently preferred embodiment of a cluster process.
  • Figure 7 depicts a presently preferred embodiment exemplary pattern matrix output.
  • Figure 8 depicts a presently preferred embodiment pattern recognition oriented cluster method processing flow.
  • Figure 9 depicts a presently preferred embodiment input data processing flow.
  • Figures 10A and 10B depict a presently preferred embodiment input data processing flow.
  • Figure 11 depicts an exemplary temporary array of raw data for compound - mutant strain pairs.
  • Figure 12 depicts an exemplary DATA array of correlation values for n compounds.
  • Figure 13 depicts a presently preferred embodiment calculate correlation value processing flow.
  • Figure 14 depicts a presently preferred embodiment determine correlation distribution processing flow.
  • Figure 15 depicts a presently preferred embodiment set grouping parameters processing flow.
  • Figure 16 depicts a presently preferred embodiment group compounds processing flow.
  • Figures 17A and 17B depict exemplary correlation value tables.
  • Figure 18 depicts an exemplary correlation value table.
  • Figures 19A, 19B and 19C depict a presently preferred embodiment group compounds processing flow.
  • Figure 20 depicts a presently preferred embodiment determine overlapping groups processing flow.
  • Figures 21 A and 21 B depict a presently preferred embodiment combine groups processing flow.
  • Figure 22 depicts an exemplary array of correlation values for seven compounds.
  • Figures 23A and 23B depict a presently preferred embodiment optimally link compounds in group processing flow.
  • Figure 24 depicts an exemplary correlation value array for a group of twelve compounds.
  • Figure 25 depicts a presently preferred embodiment optimally link groups processing flow.
  • Figure 26 depicts an exemplary array of mean correlation distance values for five groups of compounds.
  • Figures 27A, 27B, 27C, 27D and 27E depict a presently preferred embodiment optimally link groups processing flow.
  • Figure 28 depicts a presently preferred embodiment exemplary pattern matrix output.
  • Figure 29 depicts a presently preferred embodiment generate pattern matrix processing flow.
  • Figure 30 depicts an alternative embodiment exemplary pattern matrix output.
  • variable names are provided for various entities used in the described processes. The variable names are thereafter used to describe the respective entity. It will be apparent, however, to one skilled in the art, that the invention may be practiced with other variable names and/or other complimentary entities, without departing from the spirit and scope of the invention.
  • a pattern recognition oriented cluster method 110 as shown in Figure 5, raw data is gathered 112 on a set of objects to be clustered.
  • the objects to be clustered are chemical compounds, and each raw data value, or compound value, is an interaction value of the respective compound in an environment.
  • an environment is a mutant strain and each raw data value of a compound is an interaction value between the compound and a mutant strain of a set of mutant strains.
  • a further presently preferred embodiment of this invention is that the environment is a gene expression.
  • a similarity measure, or object similarity score, 113 is then generated for each of the object pairs that can be constructed from the set of objects to be clustered.
  • a correlation value, or coefficient, 114 for a respective object pair is generated from the raw data points, or compound values, for each respective object of the object pair; i.e., from the objects' interaction values with a set of mutant strains.
  • the correlation value 114 of an object pair represents the similarity of the interactions of the objects of the pair with a set of mutant strains or gene expressions.
  • the correlation value 114 of one object pair is then compared to the correlation values 114 of every other object pair as a similarity value, or coefficient, 116 of the objects comprising the object pairs.
  • the higher the correlation value 114, which represents a similarity value 116, of an object pair the more similar the objects of the object pair are deemed to be.
  • the objects, e.g., compounds, of the set of objects are clustered, or grouped, 118, using a combination of clustering processes, or techniques, 117.
  • a method of dynamic linkage is used to create groups of similar objects.
  • the groups are then analyzed, to discard those that overlap with other, larger groups, comprising the same objects.
  • the groups are then merged, in order that one particular object is a member of only one group.
  • Each group is then optimized, so that the objects in the group are ordered according to their relative similarity to each other.
  • all of the groups are optimized, so that each group is ordered, in relation to every other group, according to the relative similarity of the objects comprising the respective groups.
  • a process for cluster display 119 is then performed.
  • a pattern matrix is generated and output 120, for displaying the resultant groupings of the set of objects to be clustered.
  • the pattern matrix provides a visual representation of the grouping, similarity, and relatively similarity of the objects in a set of objects to be clustered.
  • the resultant pattern matrix comprises all the objects in all the optimized groups arranged along both an X-axis and an Y- axis. The X-axis and Y-axis are themselves comprised of the objects of the set of objects to be clustered.
  • different colors, or shades of color, corresponding to different respective correlation ranges are established in a group of color codes, to enhance the appearance of the resultant output matrix.
  • the corresponding correlation values, i.e., similarity measures, of each object pair, for the set of objects to be clustered, is mapped into a respective color code group, and a block of the respective color is output to the pattern matrix.
  • a method for the dynamic linkage of compounds into groups 155 comprises producing all the initial groups of compounds that meet either default, or user- defined, criteria. Groups with shared compounds are then merged 160, in order that only one object is in any one group; i.e., the objects in each of the established groups of objects are mutually exclusive.
  • a method of intra-group optimization 165 then optimizes the order of the objects within each group.
  • the ordering is accomplished using a nearest neighbor mapping process 170.
  • the two objects in a group i.e., the first object and the second object, that are most similar in the group, i.e., their correlation value, or similarity measure, is the highest, are located and linked together.
  • One of the two linked objects, e.g., the first object is designated the head of the link of objects
  • the other of the two linked objects e.g., the second object
  • the object i.e., a third object, of the group, if there is one, that is most similar to either one of the first two objects, i.e., a nearest neighbor object, is located. If the third object is more similar to the first object, it is linked next to the first object; i.e., it becomes the new head of the link of objects for the group. In other words, if the correlation value for the first and third objects of the group is larger, or greater, than the correlation value for the second and third objects of the group, the third object is linked next to the first object. Otherwise, if the third object is more similar to the second object, it is linked next to the second object; i.e., it becomes the new tail of the link of objects for the group. In other words, if the correlation value for the second and third objects of the group is larger than the correlation value for the first and third objects of the group, the third object is linked next to the second object.
  • a fourth object of the group if there is one, is located that is most similar to either of the ends, i.e., the head or the tail object, of the existing link of objects of the group.
  • the fourth object is then linked to the end object in the group link that it is more similar to.
  • the intra-group optimization process 165 proceeds through all the objects of a respective group, until all the objects are linked for the group.
  • the resultant link of objects of a group consists of an ordering of the objects that reflects their general relative similarity to one another.
  • a method of inter-group optimization 175 optimizes the ordering of the groups of objects.
  • a group similarity score is generated for each pair of groups.
  • the group similarity score is a mean cumulative distance value, or average correlation value, for a pair of groups. The mean cumulative distance value for a group pair serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value, the more similar the objects in the two groups generally are.
  • the ordering of the groups is then accomplished using a nearest neighbor mapping process 180 based on the respective mean cumulative distance values.
  • the two groups i.e., a first group and a second group, that are most similar, i.e., that have the largest mean cumulative distance value, are located and linked together.
  • One of the two linked groups, e.g., the first group is designated the head of the link of groups
  • the other of the two linked groups e.g., the second group
  • the group i.e., a third group, if there is one, that is most similar to either the first or second group, i.e., a nearest neighbor group. If the third group and the first group are more similar, i.e., have the higher mean cumulative distance value, the third group is linked next to the first group; i.e., the third group becomes the new head of the link of groups. Otherwise, if the third group and the second group are more similar, i.e., have the higher mean cumulative distance value, the third group is linked next to the second group; i.e., the third group becomes the new tail of the link of groups.
  • a fourth group if there is one, that is most similar to one of the end groups, i.e., the head or the tail group, of the existing link of groups is located.
  • the fourth group is then linked to the end group in the link of groups that it is most similar to, i.e., that with that end group it has the higher mean correlation distance value.
  • the inter-group optimization process 175 proceeds through all the groups, until all the groups are linked.
  • the resultant link of groups consists of an ordering of the groups that generally reflects their relative similarity to one another.
  • a pattern recognition oriented cluster process 200 is executed by a computer, i.e., a computer or other processing device or entity.
  • the computer program comprising the pattern recognition oriented cluster process 200 is stored, or otherwise resides, on a data storage device, for example, but not limited to, e.g., Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), floppy disk, flexible disk, magnetic tape. CD-ROMs, punchcards or papertape.
  • ROM Read Only Memory
  • PROM Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • floppy disk floppy disk
  • flexible disk flexible disk
  • magnetic tape CD-ROMs
  • CD-ROMs punchcards or papertape.
  • an input data process 210 is executed upon starting 205 the pattern recognition oriented cluster process 200.
  • the input data process 210 inputs the data relating to the set of objects to be clustered from an input file.
  • the input data process 210 If the input data relating to the set of objects to be clustered is raw data, i.e., a correlation value, or similarity measure, has not been generated for the object pairs in the set of objects to be clustered, then the input data process 210 also generates the correlation value for each object pair. Upon inputting all relevant data, the pattern recognition oriented cluster process 200 then executes a determine correlation distribution process 215. Generally, the determine correlation distribution process 215 generates a distribution of the correlation values, i.e., similarity measures, for all object pairs in the set of objects to be clustered. The pattern recognition oriented cluster process 200 thereafter executes a set grouping parameters process 220.
  • the set grouping parameters process 220 allows the user of the pattern recognition oriented cluster process 200 to choose whether to use default, pre-established, grouping and pattern matrix generation parameters, or input their own.
  • the parameters that a user may specify for clustering i.e., cluster parameters, include, but are not limited to, an initial correlation group limit parameter, an add-in correlation group limit parameter, a minimum correlation group limit parameter and a minimum average correlation group limit parameter.
  • the parameters that a user may specify for pattern matrix generation include, but are not limited to, the number of color groups to be used in the output pattern matrix and the correlation ranges for each of the respective color groups.
  • the pattern recognition oriented cluster process 200 also executes a group compounds process 225.
  • the group compounds process 225 groups the objects in the set of objects to be clustered into one or more groups, based on default, or, alternatively, user-specified, cluster parameters. Once all the groups that may be established for the set of objects to be clustered are so established by the group compounds process 225, the pattern recognition oriented cluster process 200 executes a determine overlapping groups process 235. Generally, the determine overlapping groups process 235 determines if there are any established groups whose objects are all contained in at least one other group. If so, the determine overlapping groups process 235 marks the group whose objects are all contained in at least one other group as an overlapping group. In a presently preferred embodiment, the pattern recognition oriented cluster process 200 does no further processing of any and all the groups marked as overlapping.
  • the pattern recognition oriented cluster process 200 executes a combine groups process 240.
  • the combine groups process 240 combines one or more groups that have one or more objects in common.
  • the objects in each of the remaining groups are mutually exclusive, i.e.; no one object is in more than one group.
  • the pattern recognition oriented cluster process 200 also executes an optimally link compounds in group process 245. Generally, the optimally link compounds in group process 245 optimally orders the objects in each group, based on the respective object pairs' correlation values, or similarity measures. The pattern recognition oriented cluster process 200 thereafter executes an optimally link groups process 250. Generally, the optimally link groups process 250 optimally orders the groups, based on group similarity scores generated for each pair of groups. In a presently preferred embodiment, a group similarity score is a mean cumulative distance value for a pair of groups.
  • the pattern recognition oriented cluster process 200 also executes a generate pattern matrix process 255.
  • the generate pattern matrix process 255 generates and outputs to, e.g., for example, but not limited to, an output file or a computer terminal screen, a pattern matrix of the grouped objects of the set of objects to be clustered.
  • the resultant pattern matrix provides a visual representation of the grouping, similarity, and relative similarity of the objects in the set of objects to be clustered.
  • the objects in the set of objects to be clustered are chemical compounds. Each compound is interacted with, or otherwise in, one or more environments.
  • an environment is a mutant strain or a gene expression.
  • Each raw data value, or compound value, in a respective input file is an interaction value between the respective compound and a mutant strain or gene expression.
  • the object similarity score for each object, e.g., compound, pair is a value indicative of the similarity of the objects in the pair when interacting with the various mutant strains in the set of mutant strains or the various gene expressions in the set of gene expressions.
  • an input data process 210 is executed. Generally, the input data process 210 inputs the data relating to the set of compounds to be clustered from an input file.
  • the input data process 210 allows for either 302 the input file to have raw data, or object similarity score data, for a set of compounds to be clustered. If the input data comprises object similarity score values 304, the data is read in 308 from the input file. Each object similarity score value read in from the respective input file is then stored 310 in an array of object similarity score values, e.g., DATA array. More specifically, each object similarity score for a compound I - compound J compound pair is stored in array Data[l][J]. When all the object similarity scores are read in from the input file and stored in a respective array, the input data process 210 is ended 312.
  • the raw data for each compound-mutant strain pair is read into an array.
  • any raw data value for a compound I - mutant strain J pair that is without an upper or lower threshold value is set 316 to the respective threshold value.
  • the raw data value for any compound I - mutant strain J pair is set 316 to the high threshold value.
  • the raw data value for the compound I - mutant strain J pair is set 316 to the low threshold value.
  • an object similarity score for compound I is generated 318.
  • Each generated object similarity score for a compound I - compound J pair is stored 320 in an array of object similarity scores, e.g., DATA array. More specifically, each generated object similarity score for a compound I - compound J pair is stored in a respective DATA[I][J] entry.
  • an input data process 350 requests the user to input one of two data formats to a computer, i.e., a computer or other processing device or entity, executing the pattern recognition oriented cluster process 200.
  • the user is requested to input 355 a value of one ("1") if the input file to be used for the pattern recognition oriented cluster process 200 comprises raw data for a set of compounds to be clustered, and a set of mutant strains.
  • the user is requested to input 355 a value of two ("2) if the input file comprises object similarity scores for compound pairs in the respective set of compounds to be clustered.
  • the user is then requested to input 360 to the computer the input file name for either the raw data, or the object similarity score data.
  • the user is also requested to input 365 to the computer the total number of compounds, e.g., N, in the set of compounds to be clustered.
  • the total number of compounds in a set of compounds to be clustered should be in the range of two to five hundred.
  • the input data process 350 then checks 370 the user-indicated data format for the input file. If the data format is equal to two ("2") 372, meaning the input file contains object similarity score data, the input data process 350 reads in 374 from the input file the names of each the compounds in the set of compounds to be clustered. The input data process 350 stores 374 each compound name in an appropriate entry in a table, e.g., NAME[ ].
  • the object similarity score for the first I th compound - first J th compound pair is stored in DATA[0][0].
  • the object similarity score for the third I th compound - second J" 1 compound pair is stored in DATA[2][1].
  • the array of object similarity scores e.g., DATA array
  • DATA array is indexed from zero.
  • DATA array is indexed from one.
  • the input data process 350 checks 370 the user- indicated data format for the input file. If the data format is equal to one ("1 ") 384, meaning the input file comprises raw data for a set of compounds to be clustered and a set of mutant strains, the input data process 350 requests the user to input 386 to the computer, i.e., computer or other processing device or entity, the total number of mutant strains, e.g., STRN, used for generating the raw data. In a presently preferred embodiment, the total number of mutant strains should be in the range of one to fifty.
  • the user is also requested to input 388 to the computer a high limit raw data value, e.g., HIGHROUND.
  • the suggested value for HIGHROUND is one hundred.
  • the user is also requested to input 390 to the computer a low limit raw data value, e.g., LOWROUND.
  • the suggested value for LOWROUND is zero.
  • the high limit raw data value and the low limit raw data value are used to bound the input raw data for a set of compounds to be clustered, and a set of mutant strain pairs.
  • the input data process 350 then loops I 392 times, once for each of the compounds in the set of compounds, e.g., N compounds, to be clustered.
  • the input data process 350 for each compound I, i.e., I th compound, loops J 394 times, once for each entry in the input file for compound I.
  • the input data process 350 checks 396 if the input entry from the input file is the first entry for compound I. In a presently preferred embodiment, if it is 426 the first entry for compound I, it is the compound code identification for compound I.
  • the input data process 350 reads in 402 the compound code for compound I from the input file, and stores it in an appropriate entry in a table, e.g., CODE[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
  • the input data process 350 checks 398 if it is the second entry in the input file for compound I. In a presently preferred embodiment, if it is 430 the second entry for compound I, it is the compound name for compound I.
  • the input data process 350 reads in 404 the compound name for compound I from the input file, and stores it in an appropriate entry in a table, e.g., NAME[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
  • the input data process 350 checks 400 if it is the third entry in the input file for compound I. In a presently preferred embodiment, if it is 434 the third entry for compound I, it is the compound description identification for compound I.
  • the input data process 350 reads in 406 the compound description for compound I from the input file, and stores it in an appropriate entry in a table, e.g., DESC[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
  • the current input entry for compound I is not 438 the third entry in the input file for compound I, then it is a raw data value, e.g., for example, but not limited to, an interaction value or an inhibitor value, for compound I and one of the set of mutant strains.
  • the input data process 350 reads in 408 the raw data value for the compound 1 - mutant strain K pair and stores it in an appropriate entry in a temporary array, e.g., V[I][K].
  • the mutant strain K value is related to the loop J 394 value, it is not the same, as loop J 394 loops through more than the respective raw data values for a compound I.
  • An exemplary array V 450 depicts raw data, e.g., for example, but not limited to, interaction values or inhibitor values, for n compounds 452 and p mutant strains 454.
  • raw data e.g., for example, but not limited to, interaction values or inhibitor values
  • exemplary array V 450 there are n times p (n x p) raw data values 456; i.e., one raw data value for each compound n - mutant strain p pair.
  • the input data process 350 checks 410 whether the input raw data value for compound I - mutant strain K, e.g., V[I][K], is greater than the high limit raw data value, e.g., HIGHROUND. If it is 412, the input raw data value is set 414 to HIGHROUND, e.g., V[I][K] is set equal to HIGHROUND. The input data process 350 then loops J 394, to process the next input entry for compound I.
  • the high limit raw data value e.g., HIGHROUND.
  • the input data process 350 checks 418 to see if it is less than the low limit raw data value, e.g., LOWROUND. If it is 420, the input raw data value is set 422 to LOWROUND, e.g., V[I][K] is set equal to LOWROUND. The input data process 350 then loops J 394, to process the next input entry for compound I.
  • the low limit raw data value e.g., LOWROUND.
  • the input data process 350 loops J 394, to process the next input entry value for compound I.
  • the input data process 350 loops J 440 times, once for all the compounds up to and including the I th compound. For example, if compound I is the fifth compound to be processed by the input data process 350, then the input data process 350 loops J five times, through the first to fifth compounds. For each loop J 440, the input data process 350 generates an object similarity score for the compound I - compound J pair.
  • the object similarity score generated for a compound I - compound J compound pair is a correlation value.
  • the correlation value i.e., the similarity value or measure, for a compound I - compound I pair is one, as compound I is being correlated with itself.
  • a correlation value for a compound I - compound J compound pair is generated from at least one raw data value for compound I and at least one raw data value for compound J.
  • Equation 1 A presently preferred embodiment of an equation for generating a correlation value for a compound I - compound J pair is shown in Equation 1.
  • Equation 1 1 and J are the raw data values for compounds I and J and the respective mutant strains, and / and J are the respective mean values for the compounds I and J.
  • the input data process 350 After each correlation value is generated 442 for a respective compound I - compound J pair, the input data process 350 loops J 440, to generate the correlation value for the next compound I - compound J pair. When all compounds J have been looped 440 through, the input data process 350 loops 392 to process the next compound I raw data from the input file. When all compounds I have been looped 392 through, i.e., their respective data is input and processed, the input data process 350 is ended 449.
  • the correlation value for a compound pair is stored in an array, e.g., DATA[ ][ ].
  • the correlation value for a compound I - compound J pair is stored in DATA[I][J].
  • the array of correlation values, e.g., DATA is indexed from zero.
  • DATA[0][0] represents the correlation value for the first compound-first compound pair.
  • DATA[1][3] represents the correlation value for the second compound-fourth compound pair.
  • the DATA array is indexed from one.
  • An exemplary DATA array 460 of correlation values is shown in Figure 12, for n compounds.
  • DATA array 460 has n compound rows 462 and n compound columns 464.
  • each correlation value 466 of DATA array 460 is a correlation value for a compound - compound pair, the DATA array 460 is indexed by compound for both its rows and its columns.
  • a DATA array 460 need only have valid entries in its lower half 468.
  • the top half 470 of a DATA array 460 is a mirror image of the correlation values in the lower half 468 of the DATA array 460.
  • the correlation values on the diagonal 472 of a DATA array 460 all have a value of one. This is because the correlation values on the diagonal 472 are the values for the identical compound pairs, i.e., compound I - compound I pairs.
  • the correlation values i.e., similarity values or measures or scores, are necessarily one for a compound I - compound I pair as compound I is being correlated with itself.
  • an object similarity score is generated 442 for each compound pair in the set of compounds to be clustered if object similarity scores are not provided in the input file, or otherwise provided to the pattern recognition oriented cluster process 200.
  • an object similarity score is a correlation value
  • a correlation value for a compound pair is generated from at least one raw data, or compound, value for the first compound of the compound pair and at least one raw data, or compound, value for the second compound of the compound pair.
  • the correlation value for a compound pair is derived from all the raw data values for the respective compounds of the compound pair and the mutant strains from the set of mutant strains.
  • summation variables e.g., SUM_AA, SUM_AB and SUM_BB are initialized to a value of zero 482.
  • the calculate correlation value process 480 generates a correlation value for a compound I - compound J pair.
  • the average raw data value, e.g., AVE_A, for compound I is generated 483.
  • all compound I - mutant strain raw data values, for all mutant strains are added together, and the sum is divided by the total number of mutant strains to generate the average raw data value, e.g., AVE_A, for compound I.
  • the average raw data value, e.g., AVE_B, for compound J is also generated 484.
  • all compound J - mutant strain raw data values, for all mutant strains are added together, and the sum is divided by the total number of mutant strains to generate the average raw data value, e.g., AVE_B, for compound J.
  • the calculate correlation value process 480 then loops K 485 times, once for each of the mutant strains in the set of mutant strains. For each K loop 485, the calculate correlation value process 480 sets 486 the entry in the temporary array of raw data for the compound I - mutant K pair, e.g., V[I][K], to its original raw data value minus the average raw data value for compound I, e.g., AVE_A, as shown in Equation 2
  • V[I][K] V[I][K] - AVE_A Equation 2
  • the calculate correlation value process 480 also sets 487 the entry in the temporary array of raw data for the compound J - mutant K pair, e.g., V[J][K], to its original raw data value minus the average raw data value for compound J, e.g., AVE_B, as shown in Equation 3.
  • V[J][K] V[J][K] - AVE_B Equation 3
  • each raw data, or compound, value, for a respective compound is decreased, or otherwise altered, by the average raw data value for the compound.
  • the calculate correlation value process 480 also generates 488 a new value of SUM_AB, SUM_AA and SUM_BB. More specifically, for each loop K 485, the new value for the compound I - mutant K pair in the temporary array of raw data, e.g., V[I][K], and the new value for the compound J - mutant K pair in the temporary array of raw data, e.g., V[J][K], are multiplied together and added to a running sum, e.g., SUM_AB, as shown in Equation 4
  • the raw data value for each compound I - mutant K pair, altered by the average raw data value for compound I is multiplied by the raw data value for the respective compound J - mutant K pair, altered by the average raw data value for compound J.
  • a running summation, e.g., SUM_AB, of these multiplications is also generated.
  • the new value for the compound I - mutant K pair in the temporary array of raw data, e.g., V[I][K] is squared, i.e., multiplied with itself, and added to a running sum, e.g., SUM_AA, as shown in Equation 5.
  • Equation 6 the new value for the compound J - mutant K pair in the temporary array of raw data, e.g., V[J][K], is squared, i.e., multiplied with itself, and added to a running sum, e.g., SUM_BB, as shown in Equation 6.
  • the raw data value for each compound J - mutant K pair, altered by the average raw data value for compound J, is multiplied by itself.
  • a running summation, e.g., SUM_BB, of these multiplications is also generated.
  • the correlation value for the compound I - compound J pair is generated 490.
  • the correlation value for a compound I - compound J pair is set to SUM_AB divided by the square root of the multiplication of SUM_AA and SUM_BB, as shown in Equation 7.
  • a determine correlation distribution process 215 is executed.
  • the determine correlation distribution process 215 generates a distribution of the correlation values, i.e., similarity measures, for ail compound pairs in the set of compounds to be clustered.
  • the determine correlation distribution process 215 loops 502 through each unique correlation value in the array of correlation values, e.g., DATA array.
  • the unique correlation values for a set of compounds to be clustered is stored in the lower half, or bottom, 468 of the DATA array, below, and not including, the diagonal 472.
  • the determine correlation distribution process 215 loops 502 through the lower half 468 of the DATA array, i.e., through the unique correlation values.
  • the determine correlation distribution process 215 multiplies 504 each unique correlation value by ten ("10").
  • the determine correlation distribution process 215 determines 506 the group, of a number of groups of correlation ranges, that the respective unique correlation value, multiplied by a factor of ten, is within.
  • Table 1 is a presently preferred embodiment of the groups of correlation ranges.
  • Column A of Table 1 represents the presently preferred range of correlation values for each of the nineteen groups.
  • Column B of Table 1 represents the presently preferred range of correlation values for each of the nineteen groups multiplied by a factor of ten.
  • the determine correlation distribution process 215 keeps a running summation 508, e.g., COUNT[ ], of the number of unique correlation values from the DATA array that are within each of the nineteen groups of correlation ranges.
  • the determine correlation distribution process 215 also keeps a running summation 510, e.g., SUM, of all the unique correlation values in the DATA array.
  • a running summation 510 e.g., SUM
  • the determine correlation distribution process 215 generates 512 the average correlation value from the running summation, e.g., SUM, for all the unique correlation values, i.e., for all unique compound I - compound J pairs, where I does not equal J.
  • the determine correlation distribution process 215 also determines 514 the percentage of unique correlation values in each of the nineteen groups of correlation ranges, from the various stored running summations, e.g., COUNTf ], for all the unique correlation values, i.e., for all unique compound I - compound J pairs, where I does not equal J.
  • the determine correlation distribution process 215 determines 514 the percentage of unique correlation values in the DATA array that are in the 1 st group, e.g., COUNT[0], of correlation ranges, the percentage of unique correlation values in the DATA array that are in the 2 nd group of correlation ranges, e.g., COUNT[1], and so on.
  • a set grouping parameters process 220 is executed.
  • the set grouping parameters process 220 allows the user of the pattern recognition oriented cluster process 200 to choose whether to use default, pre-established, grouping and pattern matrix generation, or group selection, parameters, or input their own.
  • a presently preferred embodiment of the set grouping parameters process 220 requests 566 the user to input to the computer, i.e., computer or other processing device or entity, whether they wish to change the default criteria for establishing groups of compounds.
  • the user is requested 566 to indicate whether or not they wish to choose their own group formation selection, or correlation selecting, or grouping, limits, or parameters.
  • the set grouping parameters process 220 expects a YES or NO response from the user. If the user responds with a NO 567, indicating that they wish to use the default correlation selecting limits, or grouping parameters, the set grouping parameters process 220 checks 565 whether there are any correlation values in the DATA array that have values that are greater than or equal to the default first group selection parameter value.
  • the set grouping parameters process 220 checks the number of correlation values in the first, second and third groups of correlation ranges, e.g., COUNT[0], COUNT[1] and COUNT[2], which were set in the determine correlation distribution process 215.
  • the first group selection parameter is an initial correlation group limit parameter.
  • the default initial correlation group limit parameter e.g., INICORR
  • the default initial correlation group limit parameter has a value of 0.7.
  • the set grouping parameters process 220 ends 570 the pattern recognition oriented cluster process 200. This is because there are no groups that can be formed with the default group selection parameters, or correlation selecting limits.
  • the set grouping parameters process 220 sets 571 each of the group selection parameters to a default value.
  • a first group selection parameter is used for selecting a first compound pair for a new group; it is a group core defining limit.
  • the first pair of compounds of a new group of compounds must have a correlation value greater than or equal to the first group selection parameter.
  • the first group selection parameter is an initial correlation group limit parameter, e.g., INICORR, whose default value is 0.7.
  • a second group selection parameter is used for selecting an add-in, or new, compound for a group.
  • the second group selection parameter is an add-in score, or limit or value, that defines the nearest neighbor compounds of the core group of compounds.
  • an add-in, or new, compound to a group of compounds must, with a compound member of the group, have a correlation value greater than or equal to the second group selection parameter.
  • the second group selection parameter is an add-in correlation group limit parameter, e.g., MAXCORR, whose default value is 0.6.
  • a third group selection parameter is used for defining the furthest neighbor compounds in a core group of compounds.
  • an add-in, or new, compound to a group of compounds must, with each compound member of the group, have a correlation value greater than or equal to the third group selection parameter.
  • the third group selection parameter is a minimum correlation group limit parameter, e.g., MINCORR, whose default value is 0.4.
  • a fourth group selection parameter is used to define the compactness of the established groups of compounds.
  • the fourth group selection parameter is mean distance score, or limit, of a potential add- in compound and all the compounds already in the group.
  • the average correlation value for the pairs of compounds comprised of the potential add-in compound and each compound already in the group must be greater than or equal to the fourth group selection parameter.
  • the fourth group selection parameter is a minimum average correlation group limit parameter, e.g., AVECORR, whose default value is 0.5.
  • the set grouping parameters process 220 requests the user to input 569 a value for each of four correlation group selection parameters.
  • the set grouping parameters process 220 requests the user to input an initial correlation value for selecting a first compound pair for a new group, e.g. INICORR.
  • the set grouping parameters process 220 also requests the user to input a correlation value for adding a new compound to an existing group, e.g., MAXCORR.
  • the set grouping parameters process 220 also requests the user to input an average correlation value, e.g., AVECORR, for ail compounds in a group, whenever a new compound is a possible addition to the group.
  • the set grouping parameters process 220 also requests the user to input a minimum correlation value for any compound to be added to a group, e.g., MINCORR.
  • the set grouping parameters process 220 sets 572 the default number of output color groups, e.g., COLOR[ ], to six, for use by a generate pattern matrix process 255 of the pattern recognition oriented cluster process 200, for formatting and outputting a pattern matrix.
  • the set grouping parameters process 220 then requests 575 the user to indicate whether they wish to use their own number of output color groups.
  • the set grouping parameters process 220 sets 580 each of the six default color groups to the lower correlation value of their correlation value range. Specifically, in a presently preferred embodiment, the set grouping parameters process 220 sets each of the default color groups, e.g., COLOR[0] to COLOR[5], to a default lower correlation value.
  • the default correlation value range and the default lower correlation value for each of the six default color groups is shown in Table 2.
  • the user is requested 578 to input to the computer, i.e., computer or other processing device or entity, the number of color groups, e.g., M, that they want.
  • the maximum number of color groups that a user may designate is ten.
  • the set grouping parameters process 220 For each color group, e.g., COLOR[ ], the set grouping parameters process 220 then requests 579 the user to input to the computer the respective lower correlation value. In a presently preferred embodiment, the set grouping parameters process 220 suggests that the user set the final color group lower correlation value to -1.0. In a presently preferred embodiment, COLOR[ ] is indexed from zero. The set grouping parameters process 220, for each color group, e.g., X from one to the total number of color groups, e.g., M, sets COLOR[X-1] to the user inputted lower correlation value.
  • COLOR[ ] is indexed from one.
  • the set grouping parameters process 220 for each color group X from one to the total number of color groups M sets COLOR[X] to the user inputted lower correlation value.
  • the set grouping parameters process 220 is ended 581. Referring again to Figure 8, upon executing the set grouping parameters process 220 in the pattern recognition oriented cluster process 200, a group compounds process 225 is executed.
  • the group compounds process 225 groups the compounds in the set of compounds to be clustered into one or more groups, based on default, or alternatively, user-specified, group selection parameters, or correlation selecting limits or cluster parameters.
  • the group compounds process 225 first writes 542 the mirror image of the bottom half of the correlation value array, e.g., DATA, to the top half of the array.
  • the group compounds process 225 writes 542 all the values in the bottom half 651 of the DATA array to the respective entries in the top half 653 of the array.
  • the resultant DATA array e.g. DATA array 655 of Figure 17B
  • the correlation value in DATA[2][1] is written to DATA[1][2].
  • the correlation value in DATA[3][1] is written to DATA[1][3]
  • the correlation value in DATA[3][2] is written to DATA[2][3].
  • the group compounds process 225 loops 526 through all compound pairs in the DATA array.
  • the group compounds process 225 attempts to locate 528 an initial compound pair for a new group, e.g., G; i.e., the group compounds process 225 attempts to locate 528 a compound pair whose correlation value is greater than or equal to the initial correlation group limit parameter INICORR.
  • the presently preferred default value for INICORR is 0.7.
  • the first compound pair with a correlation value greater than or equal to the default value of INICORR are compounds one and three; i.e. DATA[1][3] is equal to 0.7.
  • DATA[1][1] while having a correlation value of 1.0, which is greater than INICORR equal to 0.7, is not the first compound pair as it is comprised of a single compound, rather than a compound pair.
  • the first two compounds of the new group G are compound one and compound three.
  • the group compounds process 225 upon locating 528 an initial pair of compounds for a new group G, loops 529 through all the compound pairs, now attempting to locate 530 a potential compound to add into group G. More specifically, in a presently preferred embodiment, the group compounds process 225 loops 529 through all the possible compound pairs looking for a potential add-in compound, i.e., a compound to be added to the group G, whose correlation value with a compound already in group G is greater than or equal to the add-in correlation group limit parameter MAXCORR.
  • a potential add-in compound i.e., a compound to be added to the group G, whose correlation value with a compound already in group G is greater than or equal to the add-in correlation group limit parameter MAXCORR.
  • the presently preferred default value for MAXCORR is 0.6.
  • the first compound that paired with compounds one or three of the present group G has a correlation value greater than or equal to the default value MAXCORR is compound six; i.e. DATA[3][6], is equal to 0.8.
  • compound six is a potential add-in compound to the present group G.
  • DATA[1][1] and DATA[1][3] have a correlation value greater than or equal to MAXCORR; i.e., compound one - compound one has a correlation value of one, and the compound one - compound three pair has a correlation value of 0.7.
  • compounds one and three, corresponding to the respective correlation values DATA[1][1] and DATA[1][3] are already in group G.
  • the first correlation value e.g., DATA[3][1]
  • the first correlation value has a correlation value greater than MAXCORR.
  • compounds three and one, corresponding to the respective correlation value DATA[3][1] are already in group G.
  • the next entry in row 662 with a correlation value greater than MAXCORR is
  • the group compounds process 225 Upon locating 530 a potential add-in compound for group G, the group compounds process 225 checks 531 whether the correlation values of the potential add-in compound paired with each compound already in group G are all greater than or equal to the minimum correlation group limit parameter MINCORR.
  • the presently preferred default value for MINCORR is 0.4.
  • the compounds currently in group G are one and three.
  • the potential add-in compound for group G is six.
  • the correlation value for the compound one - compound six pair e.g., DATA[1][6] is 0.5, which is greater than the default value MINCORR.
  • the correlation value for the compound three - compound six pair e.g., DATA[3][6] is 0.8, which is greater than the default value MINCORR.
  • the group compounds process 225 loops 529 to the next compound pair in search of a potential add-in compound for group G. In a presently preferred embodiment, if all the correlation values of the potential add-in compound paired with each compound already in group G are 533 greater than or equal to MINCORR, the group compounds process 225 checks 534 whether, assuming the potential add-in compound were included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter AVECORR.
  • the presently preferred default value for AVECORR is 0.5.
  • the compounds currently in group G are one and three, and the potential add-in compound for group G is six.
  • the group compounds process 225 checks the average correlation value of the compound pairs compound one - compound three, e.g., DATA[1][3], compound one - compound six, e.g., DATA[1][6] and compound three - compound six, e.g., DATA[3][6].
  • the correlation value for the compound one - compound three pair e.g., DATA[1][3] is 0.7
  • the correlation value for the compound one - compound six pair, e.g., DATA[1][6] is 0.5
  • the correlation value for the compound three - compound six pair, e.g., DATA[3][6] is 0.8.
  • the average correlation value of DATA [1][3], DATA[1][6] and DATA[3][6] is 0.67, which is greater than the default value AVECORR.
  • the group compounds process 225 loops 529 to the next compound pair, in search of a potential add-in compound for group G.
  • the group compounds process 225 checks 537 whether the potential add-in compound is already a member of group G.
  • the group compounds process 225 loops 529 to the next compound pair, in search of a potential add-in compound for group G. If, however, the potential add-in compound is not 539 a member of group G, the group compounds process 225 adds 540 the potential add-in compound to group G. Thus, in the example of DATA array 660 of Figure 18, compound six is added to group G, which is already comprised of compounds one and three. The group compounds process 225 then loops 529 to the next compound pair, in search of another potential add-in compound for group G.
  • the group compounds process 225 loops 529 through all the possible compound pairs for group G, it then loops 526 to the next compound pair for creating a new group G of compounds.
  • the group compounds process 225 loops 526 through all compound pairs for initiating new groups of compounds, it is ended 527.
  • a more detailed presently preferred embodiment of a group compounds process 650 first writes 648, or otherwise fills in, the mirror image of the bottom half of the correlation value array, e.g., DATA, to the top half of the array.
  • the correlation value array e.g., DATA
  • the group compounds process 650 then loops 604 through all the compound pairs, attempting to locate 600 a compound X - compound Y pair whose correlation value, e.g., DATA[X][Y], is greater than or equal to INICORR, i.e., the initial correlation group limit parameter and the minimum value for selecting a first compound pair for a group.
  • a compound X - compound Y pair whose correlation value is greater than or equal to INICORR is located 600, a new group G of compounds is initiated.
  • the group compounds process 650 then loops 601 through each correlation value in the array of correlation values, e.g., DATA. If all correlation values have been looped 601 through for the current group G, the group compounds process 650 continues its loop 604 through all the compound pairs, attempting to locate 600 a new compound X - compound Y pair for initiating a new group G. If all compound pairs have been looped 604 through for generating new groups of compounds, the group compounds process 650 ends 602. As previously described, the group compounds process 650 loops 601 through each correlation value in the DATA array.
  • the group compound process 650 For each correlation value in DATA, the group compound process 650 checks 603 whether the row compound of the correlation value, e.g., DATA[row compound][column compound], is equal to compound X, which is the row compound of the initial compound pair of group G, and, thus, already a member of group G. If the row compound of the current correlation value being processed, i.e., the current correlation value, is equal to 619 compound X, the group compounds process 650 loops 601 to the next correlation value in the DATA array.
  • the row compound of the correlation value e.g., DATA[row compound][column compound
  • the group compounds process 650 loops 601 to the next correlation value in DATA array, i.e., DATA[1][3].
  • the group compounds process 650 does not check the correlation values DATA[A][B] where A is equal to X, the row compound of the initial compound pair comprising group G.
  • the group compounds process 650 does not check any correlation value in the compound 2 row of the DATA array; i.e., DATA[1][y].
  • the group compound process 650 also checks 603 whether the column compound of the correlation value, e.g., DATA[row compound][column compound], is equal to compound Y, which is the column compound of the initial compound pair of group G, and, thus, already a member of group G.
  • the group compounds process 650 loops 601 to the next correlation value in the DATA array. For example, if the current correlation value row compound is 2 and column compound is 3, i.e., DATA[1][2], and compound 3 is the Y compound in group G, the group compounds process 650 loops 601 to the next correlation value in DATA array, i.e., DATA[1][3]. Thus, in a presently preferred embodiment, the group compounds process
  • the group compounds process 650 does not check any correlation value in the compound 3 column of the DATA array; i.e., DATA[x][2].
  • the group compounds process 650 checks 605 whether the row compound of the current correlation value is already a member of group G. Thus, the group compounds process 650 checks 605 whether the row compound of the current correlation value is a member of group G, but not the row compound X comprising the initial pair of compounds of group G. If there is only the initial pair of compounds in group G at this time, i.e., compounds X and Y, then the group compounds process 650 checks 605 whether the row compound of the current correlation value is equal to Y.
  • exemplary group G is comprised of initial compound pair compound 1 - compound 3, where compound 1 is the X compound and compound 3 is the Y compound, i.e., DATA[0][2] is the first correlation value found that is greater than or equal to INICORR for new group G.
  • the group compounds process 650 checks 605 whether the current correlation value is in the row of compound 3, i.e., DATA[2][y].
  • exemplary group G is comprised of initial compound pair compound 2 - compound 4, where compound 2 is the X compound and compound 4 is the Y compound, i.e., DATA[1][3] is the first correlation value that is greater than or equal to INICORR for new group G.
  • exemplary group G is also comprised of add-in compound 5.
  • the group compounds process 650 checks 605 whether the current correlation value is in the row of compound 4, i.e., DATA[3][y], or in the row of compound 5, i.e., DATA[4][y].
  • the group compounds process 650 checks 607 whether the current correlation value is greater than or equal to the group selection parameter for adding a compound to a group, e.g., MAXCORR.
  • the group selection parameter for adding a compound to a group e.g., MAXCORR.
  • the default value of MAXCORR is 0.6.
  • the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
  • the group compounds process 650 sets 616 a flag, e.g., FLAG, to indicate that the column compound of the current correlation value may possibly be added to the group G, e.g., FLAG is set to TRUE.
  • the group compounds process 650 then loops 608 through all the compounds already in group G.
  • the group compounds process 650 checks 609 whether the correlation value of the group G compound - current correlation value column compound pair, e.g., DATA[compound in group G][current correlation value column compound], is less than the minimum correlation group limit parameter, e.g., MINCORR.
  • the default value of MINCORR is 0.4.
  • exemplary group G comprises compounds 2, 5, and 6.
  • the current correlation value is the correlation value for the row compound 5 - column compound 8 compound pair.
  • the group compounds process 650 checks 609 whether any of the correlation values DATA[1][7], for respective compound 2 of group G - compound 8 (column compound of current correlation value) pair, DATA[4][7], for respective compound 5 of group G - compound 8 (column compound of current correlation value) pair, or DATA[5][7], for respective compound 6 of group G - compound 8 (column compound of current correlation value) pair, is less than MINCORR.
  • the group compounds process 650 sets 611 a flag, e.g., FLAG, to indicate that the column compound of the current correlation value is not to be added to group G, e.g., FLAG is set to FALSE.
  • FLAG a flag
  • the group compounds process 650 sets 611 FLAG to FALSE.
  • the group compounds process 650 checks 635 whether FLAG is set to TRUE. If no 636, the current correlation value column compound is not to be added to the current group G, and the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
  • the group compounds process 650 checks whether, if the current correlation value column compound was included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR.
  • the minimum average correlation group limit parameter e.g., AVECORR.
  • a presently preferred embodiment default value for AVECORR is 0.5.
  • the group compounds process 650 assumes that the current correlation value column compound is a compound of group G and sums 643 the correlation values for all the compound pairs in group G. The group compounds process 650 then determines the average 638, e.g., AVG, correlation value for all the compound pairs in group G, by dividing the summation of the correlation values for all the compound pairs in group G by the number of compound pairs in group G.
  • the average 638 e.g., AVG
  • the group compounds process 650 sums 643 the correlation values DATA[1][4], for group G compound 2 - group G compound 5 compound pair, DATA[1][5], for group G compound 2 - group G compound 6 compound pair, DATA[4][5], for group G compound 5 - group G compound 6 compound pair, DATA[1][7], for group G compound 2 - current correlation value column compound 8 compound pair, DATA[4][7], for group G compound 5 - current correlation value column compound 8 compound pair, and DATA[5][7], for group G compound 6 - current correlation value column compound 8 compound pair.
  • the group compounds process 650 determines the group G compound pair correlation value average 638, e.g., AVG, by dividing the summation of the correlation values for all the group G compound pairs by the number of compound pairs in group G; in our example, the summation of the correlation values of all the compound pairs in group G is divided by six (6).
  • group G compound pair correlation value average 638 e.g., AVG
  • the group compounds process 650 then checks 639 whether AVG is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR. If AVG is not 640 greater than or equal to AVECORR, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, AVG is 625 greater then or equal to AVECORR, the group compounds process 650 checks 629 whether the current correlation value column compound is already a member of group G.
  • the minimum average correlation group limit parameter e.g., AVECORR
  • the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, the current correlation value column compound is not 633 already a member of group G, the group compounds process 650 adds 646 the column compound for the current correlation value to group G. The group compounds process 650 then loops 601 to the next correlation value in the array DATA, to look for another potential add-in compound for group G.
  • group G currently comprises compounds 2, 5 and 6 and the current correlation value column compound is compound 8.
  • the group compounds process 650 checks 605 whether the row compound of the current correlation value, i.e., DATA[row compound][column compound], is already a member of group G. If it is not 621 , the group compounds process 650 checks 606 whether the column compound of the current correlation value is already a member of group G. Thus, the group compounds process 650 checks 606 whether the column compound of the current correlation value is already a member of group G, but not the column compound Y comprising the initial pair of compounds of group G. If there is only the initial pair of compounds in group G at this time, i.e., compounds X and Y, then the group compounds process 650 checks 606 whether the column compound of the current correlation value is equal to X.
  • exemplary group G is comprised of initial compound pair compound 1 - compound 3, where compound 1 is the X compound and compound 3 is the Y compound of group G.
  • the group compounds process 650 checks 606 whether the current correlation value is in the column of compound 1 , i.e., DATA[x][0].
  • exemplary group G is comprised of initial compound pair compound 2 - compound 4, where compound 2 is the X compound and compound 4 is the Y compound of group G. Further, exemplary group G is also comprised of add-in compound 5.
  • the group compounds process 650 checks 606 whether the current correlation value is in the column of compound 2, i.e., DATA[x][1], or in the column of compound 5, i.e., DATA[x][4]. If the column compound for the current correlation value is not 623 equal to a compound already in group G, other than the Y compound of group G, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G.
  • the group compounds process 650 loops 601 to the next correlation value in the array DATA because compound 7, the current correlation value column compound, is not compound 2 or compound 5, the compounds in group G other than the Y compound.
  • the group compounds process 650 checks 607 whether the current correlation value is greater than or equal to the group selection parameter value for adding a new compound to a group, e.g., MAXCORR.
  • the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G. If the current correlation value is 615 greater than or equal to MAXCORR, the group compounds process 650 sets 617 a flag, e.g., FLAG, to indicate that the row compound of the current correlation value may possibly be added to the group G, e.g., FLAG is set to TRUE.
  • a flag e.g., FLAG
  • the group compound process 650 then loops 608 through all the compounds already in group G. For each compound in group G, the group compounds process 650 checks 610 whether the correlation value of the current correlation value row compound - group G compound pair, e.g., DATA[current correlation value row compound] [compound in group G], is less than the minimum correlation group limit parameter, e.g., MINCORR.
  • exemplary group G comprises compounds 2, 5, and 6.
  • the current correlation value is the correlation value for the row compound 1 - column compound 6 compound pair.
  • the group compounds process 650 checks 610 whether any of the correlation values DATA[0][1], for respective compound 1 (row compound of current correlation value) - compound 2 of group G pair, DATA[0][4], for respective compound 1 (row compound of current correlation value) - compound 5 of group G pair, or DATA[0][5], for respective compound 1 (row compound of current correlation value) - compound 6 of group G pair, is less than MINCORR.
  • the group compounds process 650 sets 612 a flag, e.g., FLAG, to indicate that the row compound of the current correlation value is not to be added to group G, e.g., FLAG is set to FALSE.
  • FLAG a flag
  • the group compounds process 650 sets 612 FLAG to FALSE.
  • the group compounds process 650 checks 635 whether FLAG is set to TRUE. If no 636, the current correlation value row compound is not to be added to the current group G, and the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
  • the group compounds process 650 checks whether, if the current correlation value row compound was included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR.
  • the minimum average correlation group limit parameter e.g., AVECORR.
  • the group compounds process 650 assumes that the current correlation value row compound is a compound of group G and sums 644 the correlation values for all the compound pairs in group G. The group compounds process 650 then determines the average 638, e.g., AVG, correlation value for all the compound pairs in group G, by dividing the summation of the correlation values for all the compound pairs in group G by the number of compound pairs in group G.
  • the average 638 e.g., AVG
  • group G is comprised of compounds 2, 5 and 6 and the current correlation value row compound is compound 1.
  • the group compounds process 650 sums 644 the correlation values DATA[1][4], for group G compound 2 - group G compound 5 compound pair, DATA[1][5], for group G compound 2 - group G compound 6 compound pair, DATA[4][5], for group G compound 5 - group G compound 6 compound pair, DATA[0][1], for current correlation value row compound 1 - group G compound 2 compound pair, DATA[0][4], for current correlation value row compound 1 - group G compound 5 compound pair, and DATA[0][5], for current correlation value row compound 1 - group G compound 6 compound pair.
  • the group compounds process 650 determines the group G compound pair correlation value average 638, e.g., AVG, by dividing the summation of the correlation values for all the group G compound pairs by the number of compound pairs in group G; in our example, the summation of the correlation values of all the compound pairs in group G is divided by six (6).
  • group G compound pair correlation value average 638 e.g., AVG
  • the group compounds process 650 then checks 639 whether AVG is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR. If AVG is not 640 greater than or equal to AVECORR, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, AVG is 626 greater then or equal to AVECORR, the group compounds process 650 checks 630 whether the current correlation value row compound is already a member of group G.
  • the minimum average correlation group limit parameter e.g., AVECORR
  • the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, the current correlation value row compound is not 637 already a member of group G, the group compounds process 650 adds 647 the row compound for the current correlation value to group G. The group compounds process 650 then loops 601 to the next correlation value in the array DATA, to look for another potential add-in compound for group G.
  • group G currently comprises compounds 2, 5 and 6, and the current correlation value row compound is compound 1.
  • the pattern recognition oriented cluster process 200 executes a determine overlapping groups process 235.
  • the determine overlapping groups process 235 determines if there are any established groups whose compounds are all contained in at least one other group. For each group X of compounds, the determine overlapping groups process 235 checks whether the compounds in group X are subsumed within any other group of compounds.
  • the determine overlapping groups process 235 loops X 750 times, once for each of the groups of compounds established in the group compounds process 225 of Figure 16, and/or the group compounds process 650 of Figures 19A-19C. When all the groups X of compounds are looped through, the determine overlapping groups process 235 ends 751.
  • the determine overlapping groups process 235 loops Y 752 times, once for each of the groups of compounds. When all the groups Y of compounds are looped 752 through, the determine overlapping groups process 235 loops 750 to the next group X of compounds.
  • the determine overlapping groups process 235 checks 753 whether group X is group Y. If so 754, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks 753 whether group Y is already labeled an overlapping group, i.e., group Y is no longer to be processed. If group Y is 754 already labeled an overlapping group, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks 753 whether group X is group Y. If so 754, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks 753 whether group Y is already labeled an overlapping group, i.e., group Y is no longer to be processed. If group Y is 754 already labeled an overlapping group, the determine overlapping groups process 235 loops 752 to the
  • each of the compounds in group Y is checked, or compared, 760 to each of the compounds in group X.
  • the determine overlapping groups process 235 next checks 756 whether all of the compounds in group Y are also in group X. If they are 758, then group Y is marked, or labeled, 759 as an overlapping group, i.e., it is no longer to be used. The determine overlapping groups process 235 then loops 752 to the next group Y. If, however, even one compound in group Y is not 757 in group X, group Y is not marked as an overlapping group at this time, and the determine overiapping groups process 235 loops 752 to the next group Y.
  • a group X comprises compounds 1 , 4, 7 and 8 and a group Y comprises compounds 1 , 4 and 7, group Y is marked as an overlapping group, as its compounds are completely subsumed within group X.
  • group Y is not marked as an overlapping group, as it comprises a compound, compound 9, that is not also in group X.
  • the pattern recognition oriented cluster process 200 upon executing the determine overlapping groups process 235, executes a combine groups process 240.
  • the combine groups process 240 combines one or more groups of compounds that have one or more compounds in common. For each group H of compounds, the combine groups process 240 checks whether one or more compounds in group H are also a member of another group K. If yes, the combine groups process 240 combines the group H and group K into one new group.
  • the combine groups process 240 loops H 850 times, once for each of the groups of compounds established in the group compounds process 225 of Figure 16, or the group compounds process 650 of
  • the combine groups process 240 checks 852 for each group H whether the group is marked as an overlapping group. As previously discussed, in a presently preferred embodiment, groups are marked, or labeled, as overlapping in the determine overlapping groups process 235. If the current group H is 853 marked as an overlapping group, i.e., it is not to be processed anymore, then the combine groups process 240 loops 850 to the next group H.
  • the combine groups process 240 loops K 855 times, once for each group that has not already been processed as an H group. For example, if there are ten groups of compounds and current group H is the second group, then K is eight, and the combine groups process 240 loops 855 eight times, through the third through tenth groups.
  • the combine groups process 240 checks 856 whether the group is marked as an overlapping group. If it is 858, the combine groups process 240 loops 855 to the next group K.
  • the combine groups process 240 loops J 859 times, once for each compound in group K.
  • the combine groups process 240 for each compound J in group K, loops I 860 times, once for each compound in the initial group H.
  • the combine groups process 240 loops 859 to the next J compound in group K.
  • the combine groups process 240 checks every compound in the H group against every compound in the K group.
  • the combine groups process 240 keeps a running summation, or total, 861 of correlation values for each group H, compound I - group K, compound J pair. For example if the current group H is comprised of compounds 2, 5 and 7 and the current group K is comprised of compounds 4 and 6, then the combine groups process sums the correlation values DATA[1][3], for the group H compound 2 - group K compound 4 compound pair, DATA[4][3], for the group H compound 5 - group K compound 4 compound pair, DATA[6][3], for the group H compound 7 - group K compound 4 compound pair, DATA[1][5], for the group H compound 2 - group K compound 6 compound pair, DATA[4][5], for the group H compound 5 - group K compound 6 compound pair, and DATA[6][5], for the group H compound 7 - group K compound 6 compound pair.
  • the combine groups process 240 also checks 862 if compound I of group H is the same as compound J of group K. If yes 863, the combine groups process 240 flags 865 compound I of group H as also a member of group K. In a presently preferred embodiment, the combine groups process 240 sets a flag 865, e.g. FLAG, to indicate that group H and group K overlap, e.g., FLAG is set equal to TRUE. The combine groups process 240 then loops 860 to the next compound I in group H. If compound I is not the same 864 as compound J of group K, the combine groups process 240 simply loops 860 to the next I compound in group H.
  • a flag 865 e.g. FLAG
  • the combine groups process 240 When all compounds I for group H have been looped 860 through for all 859 compounds J of group K, the combine groups process 240 generates a group similarity score for the group H - group K group pair.
  • a group similarity score for a group H - group K group pair is generated from at least the correlation value for a group H compound - group K compound pair.
  • the correlation value for a compound from group H and a compound from group K comprises the group similarity score for the group H - group K group pair.
  • a group similarity score is a mean cumulative distance value, or average correlation value, for a pair of groups.
  • the combine groups process 240 generates 866 the mean cumulative distance value for all group H - group K compound pairs, and stores them in an array, e.g., AVG.
  • the AVG[H][K] value is generated from the running summation 861 of correlation values for each group H, compound I - group K, compound J pair; it is the average of all group H, compound I - group K, compound J correlation values
  • the mean cumulative distance value for a pair of groups serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value for a pair of groups, the more similar the objects in the two groups generally are.
  • group H is the second group and it is comprised of three compounds - compounds 2, 5 and 7.
  • Group K is the fourth group and it is comprised of two compounds - compounds 4 and 6.
  • AVG[1][3] is the average correlation value for all the compound pairs in [group H][group K].
  • Equation 8 the value of AVG[1][3] is shown in Equation 8.
  • AVG[1][3] (DATA[1][3] + DATA[1][5] + DATA[4][3] + Equation 8 DATA[4][5] + DATA[6][3] + DATA[6][5])/6
  • the combine groups process 240 checks 867 whether FLAG is set TRUE, i.e., whether there is at least one compound in group H that is also in group K. If no 871 , the combine groups process 240 loops 855 to the next group K. If, however, FLAG is set TRUE 872, indicating there is at least one compound in group H that is also in group K, the combine groups process 240 writes, or otherwise stores or assigns, all of the compounds of group H that are not already a member of group K to group K. In a presently preferred embodiment, the combine groups process 240 loops I 868 times, once for each of the compounds in group H. For each compound I, the combine groups process 240 checks 869 whether it has been flagged as being in group K.
  • the combine groups process 240 loops 868 to the next compound I in group H. If compound I in group H is not 874 flagged as being in group K, compound I is written, or stored, 870 to group K. The combine groups process 240 then loops 868 to the next compound I.
  • the combine groups process 240 marks, or labels, 876 group H as an overlapping group, i.e., it is no longer to be used.
  • the combine groups process 240 then loops 850 to the next group H, and begins the process anew.
  • an optimally link compounds in group process 245 is executed.
  • the optimally link compounds in group process 245 optimally orders the compounds in each established group of compounds, based on the respective compound pairs' object similarity score.
  • the optimally link compounds in group process 245 optimally orders the compounds in each group of compounds based on the respective compound pairs' correlation value.
  • the optimally link compounds in group process 245 ends 901.
  • the optimally link compounds in group process 245 checks 902 for each group H whether the group is marked as an overlapping group, i.e., the group is not to be used.
  • groups are marked as overlapping groups in the determine overlapping groups process 235, as described with reference to Figure 20, and/or in the combine groups process 240, as described with reference to Figures 21 A-21 B.
  • the optimally link compounds in group process 245 loops 900 to the next group H. If, however, the current group H is not 904 marked as an overlapping group, the optimally link compounds in group process 245 loops through all the compounds in group H and locates the two unique compounds with the largest correlation value.
  • exemplary group H comprises compounds C 926, G 927, J 928 and K 929.
  • exemplary correlation value array DATA 925 the largest correlation value for a unique compound pair for the compounds of exemplary group H is the correlation value 930 for compounds G 927 and J 928, which is equal to 0.9; i.e., DATA[6][9] is equal to 0.9.
  • the other correlation values for the other unique compound pairs in exemplary group H are less than 0.9; i.e., DATA[2][6], for the compound C 926 - compound G 927 compound pair, is 0.6; DATA[2][9], for the compound C 926 - compound J 928 compound pair, is equal to 0.5; DATA[2][10], for the compound C 926 - compound K 929 compound pair, is equal to 0.5; DATA[6][10], for the compound G 927 - compound K 929 compound pair, is equal to 0.6; and, DATA[9][10], for the compound J 928 - compound K 929 compound pair, is equal to 0.8.
  • the optimally link compounds in group process 245 sets 905 a first variable, e.g. MAX1 , to the compound row of the largest correlation value.
  • MAX1 is set to six, which is the row compound index of the DATA array 925 corresponding to compound G 927.
  • the optimally link compounds in group process 245 also sets 905 a second variable, e.g., MAX2, to the compound column of the largest correlation value.
  • MAX2 is set to nine, which is the column compound index of the DATA array 925 corresponding to compound J 928.
  • the optimally link compounds in group process 245 also flags 906 the compound equal to MAX1 and compound equal to MAX2 as already linked for group H.
  • compound G 927 and compound J 928 of group H are flagged as linked for group H.
  • the optimally link compounds in group process 245 also sets 906 the MAX1 compound as the current head of the link of compounds for group H, and the MAX2 compound as the current tail of the link of compounds for group H.
  • compound G 927 is set as the current head of the link
  • compound J 928 is set as the current tail of the link.
  • the optimally link compounds in group process 245 then checks 907 whether all the compounds in the current group H are flagged as linked for group H. If yes 908, the optimally link compounds in group process 245 loops 900 to the next group H.
  • the optimally link compounds in group process 245 loops I 910, once for each compound in group H that is not already linked, and locates the largest correlation value of the MAX1 - non-linked compounds in group H pairs.
  • the largest correlation value for a MAX1 - non-linked compound in group H pair is set to a variable, e.g., MAXCORR.
  • the optimally link compounds in process 245 loops 910 through all compounds in group H that are not already linked, it also locates the largest correlation value of the MAX2 - non-linked compounds in group H pairs.
  • the largest correlation value for a MAX2 - non- linked compound in group H pair is set to a variable, e.g., MINCORR.
  • MAX1 is set to the DATA array index for compound G 927 and MAX2 is set to the DATA array index for compound J 928.
  • Compounds C 926 and K 929 of group H are not linked for the group yet.
  • the optimally link compounds in group process 245 finds the largest correlation value between the MAX1 - compound C 926 and MAX1 - compound K 929 compound pairs, and stores it in MAXCORR.
  • the optimally link compounds in group process 245 checks the correlation value DATA[6][2], i.e., the correlation value for the MAX1 compound G 927 - compound C 926 pair, and the correlation value DATA[6][10], i.e., the correlation value for MAX1 compound G 927 - compound K 929 pair.
  • the optimally link compounds in group process 245 sets MAXCORR to the correlation value for the first MAX1 - non- linked compound pair.
  • DATA[6][2] i.e., 0.6
  • MAXCORR is set to DATA[6][2]
  • the non-linked compound associated with MAXCORR is compound C 926.
  • the correlation value DATA[6][6] i.e., the correlation value for the MAX1 compound G 927 - MAX1 compound G 927 pair
  • the correlation value DATA[6][9] i.e., the correlation value for the MAX1 compound G 927 - MAX2 compound J 928, are not checked as both compounds G 927 and J 928 are already linked for group H.
  • MAX1 is set to the DATA array index for compound G 927 and MAX2 is set to the DATA array index for compound J 928.
  • Compounds C 926 and K 929 of group H are not linked for the group yet.
  • the optimally link compounds in group process 245 also finds the largest correlation value between the MAX2 - compound C 926 and MAX2 - compound K 929 compound pairs, and stores it in MINCORR.
  • the optimally link compounds in group process 245 checks the correlation value DATA[9][2], i.e., the correlation value for the MAX2 compound J 928 - compound C 926 pair, and the correlation value DATA[9][10], i.e., the correlation value for the MAX2 compound J 928 - compound K 929 pair.
  • DATA[9][10] i.e., 0.8
  • DATA[9][2] i.e., 0.5
  • MINCORR is set to DATA[9][10] and the non- linked compound associated with MINCORR is compound K 929.
  • the correlation value DATA[9][6], i.e., the correlation value for the MAX2 compound J 928 - MAX1 compound G 927 pair, and the correlation value DATA[9][9], i.e., the correlation value for the MAX2 compound J 928 - MAX2 compound J 928 pair, are not checked as both compounds G 927 and J 928 are already linked for group H.
  • the optimally link compounds in group process 245 locates 910 the maximum correlation value, e.g., MAXCORR, for the MAX1 compound - non-linked compound pairs.
  • the optimally link compounds in group process 245 also locates 910 the maximum correlation value, e.g., MINCORR, for the MAX2 compound - non-linked compound pairs. Once both MAXCORR and MINCORR are located for the respective MAX1 and MAX2 compound rows of the correlation value array, e.g., DATA, the optimally link compounds in group process 245 checks 911 whether MAXCORR is greater than MINCORR. If yes 914, the non- linked compound in group H associated with MAXCORR has a stronger correlation, or similarity, with the current head of the link of group H compounds than does the non-linked compound in group H associated with MINCORR have with the current tail of the link of group H compounds.
  • the maximum correlation value e.g., MINCORR
  • the optimally link compounds in group H process 245 links 912 the non-linked compound associated with MAXCORR as the new head of the link of group H compounds.
  • the optimally link compounds in group process 245 also flags 916 the new link head compound as linked for group H.
  • the variable MAX1 is also set 917 to the DATA array index for the new link head compound.
  • the optimally link compounds in group process 245 then checks 907 whether all the compounds in group H are now flagged as linked for group H.
  • the non-linked compound corresponding to the correlation value MAXCORR is compound C 926. If MAXCORR had been greater than MINCORR, compound C 926 would be linked to the preceding link head for group H, compound G 927. Compound C 926 would then be the new link head for group H and MAX1 would be set to the DATA array index for compound C 926. Further, compound C 926 would be flagged as linked for group H. However, in exemplary DATA array 925 of Figure 24, MAXCORR, i.e., 0.6, is not greater than MINCORR, i.e., 0.8.
  • the optimally link compounds in group H process 245 links 913 the non-linked compound associated with MINCORR as the new tail of the link of group H compounds.
  • the optimally link compounds in group process 245 also flags 918 the new link tail compound as linked for group H.
  • the variable MAX2 is also set 919 to the DATA array index for the new link tail compound.
  • the optimally link compounds in group process 245 then checks 907 whether all the compounds in group H are now flagged as linked for group H.
  • the non-linked compound corresponding to the correlation value MINCORR is compound K 929.
  • MAXCORR is not greater than MINCORR
  • compound K 929 is linked to the current link tail for group H, compound J 928.
  • Compound K 929 is the new link tail for group H and MAX2 is set to the DATA array index for compound K 929. Further, compound K 929 is flagged as linked for group H.
  • the link for exemplary group H is shown in Equation 9.
  • compound G - compound J - compound K Equation 9 In Equation 9, compound G 927 is the head of the link and compound K 929 is the tail of the link for group H.
  • the optimally link compounds in group process 245 locates 910 the maximum correlation value, e.g., MAXCORR, for the MAX1 compound - non-linked compounds in group H pairs.
  • the optimally link compounds in group process 245 also locates 910 the maximum correlation value, e.g., MINCORR, for the MAX2 compound - non-linked compounds in group H pairs.
  • the current MAX1 compound is compound G 927 and the current MAX2 compound is compound K 929.
  • the only non-linked compound remaining in group H is compound C 926.
  • the optimally link compounds in group process 245 checks 910 the correlation value DATA[6][2], i.e., the correlation value for the MAX1 compound G 927 -compound C 926 pair.
  • the correlation value DATA[6][2] is 0.6, and there are no other MAX1 compound - non-linked compound correlation values to check, as there are no other compounds in group H to link, MAXCORR is set to 0.6
  • the optimally link compounds in group process 245 also checks 910 the correlation value DATA[10][2], i.e., the correlation value for the MAX2 compound K 929 - compound C 926 pair. As the correlation value DATA[10][2] is 0.5, and there are no other MAX2 compound - non-linked compound correlation values to check, as there are no other compounds in group H to link, MINCORR is set to 0.5
  • the optimally link compounds in group H process 245 checks 911 whether MAXCORR is greater than MINCORR.
  • MAXCORR i.e., 0.6
  • MINCORR i.e., 0.5
  • the optimally link compounds in group process 245 links 912 the non-linked compound corresponding to MAXCORR, i.e., compound C 926, to the preceding link head for group H, compound G 927.
  • Compound C 926 is the new link head for group H and compound C 926 is flagged 916 as linked for group H.
  • MAX1 is set 917 to the DATA array index for compound C 926.
  • Equation 10 compound C 926 is the head of the link and compound K 929 is the tail of the link for group H.
  • the optimally link compounds in group process 245 loops 900 to the next group H. As previously described, when all groups H have been looped 900 through, the optimally link compounds in group process 245 is ended 901.
  • an optimally link groups process 250 is executed.
  • the optimally link groups process 250 optimally orders the groups of compounds, based on the group similarity scores.
  • the optimally link groups process 250 optimally orders the groups of compounds based on the mean cumulative distance values of the respective pairs of groups.
  • an average correlation value, or mean cumulative distance value e.g., AVG[group1][group2] is generated for each group pair, i.e., each two groups of the groups formed of the objects to be clustered, in the combine groups process 240.
  • the mean cumulative distance value for a pair of groups serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value for a pair of groups, the more similar the objects in the two groups generally are.
  • the combine groups process 240 only generates mean cumulative distance values for the top half 1052 of the array of average correlation values for all group pairs.
  • the optimally link groups process 250 first writes, or otherwise copies or stores, 1041 the mirror image of the top half of the average correlation value array, e.g., AVG, to the bottom half of the array.
  • the optimally link groups process 250 writes 1041 all the values in the top half 1052 of the AVG array 1050 to the respective entries in the bottom half 1051 of the AVG array 1050.
  • the average correlation value in AVG[0][1] is written to AVG[1][0].
  • the average correlation value in AVG[0][2] is written to AVG[2][0];
  • the average correlation value in AVG[0][3] is written to AVG[3][0]; and so on.
  • the diagonal 1053 of the AVE array 1050 is generally not relevant, as it represents the average correlation value for one group, and not for a pair of groups.
  • the combine groups process 240 after generating the top half of the average correlation value array, e.g., AVG, writes, or otherwise copies, the mirror image of the top half of the AVG array, to the bottom half of the AVG array.
  • the combine groups process 240 after generating a mean cumulative distance value for a pair of groups, writes the value to both the top half and the bottom half of the AVG array.
  • the optimally link groups process 250 copies, or otherwise writes or stores, 1041 the top half of the array of average correlation values to the bottom half, for all groups that have not been previously labeled as overlapping, the optimally link groups process 250 locates 1025 the two groups, e.g., groupl and group2, with the largest mean cumulative distance value.
  • overlapping groups are identified and labeled in the determine overlapping groups process 235, as described with reference to Figure 20, and/or in the combine groups process 240, as described with reference to Figures 21 A-21 B. Groups marked as overlapping are not to be further processed.
  • the optimally link groups process 250 sets 1026 the head of the link of groups to groupl and the tail of the link of groups to group2.
  • the optimally link groups process 250 also flags 1026 groupl and group2 as linked groups.
  • the optimally link groups process 250 then loops 1027 through all groups that are not already linked and are not flagged as overlapping. For each loop 1027, the optimally link groups process 250 locates 1028 the non-linked group I, that with the current head of the link group, or simply head group, has the largest mean cumulative distance value. In a presently preferred embodiment, the optimally link groups process 250 sets a variable, e.g., MAXCORR, to this largest mean cumulative distance value, e.g., AVE[head][l].
  • a variable e.g., MAXCORR
  • the optimally link groups process 250 For each loop 1027, the optimally link groups process 250 also locates 1029 the non-linked group J, that with the current tail of the link group, or simply tail group, has the largest mean cumulative distance value. In a presently preferred embodiment, the optimally link groups process 250 sets a variable, e.g., MINCORR, to this largest mean cumulative distance value, e.g., AVE[tail][J].
  • MINCORR this largest mean cumulative distance value
  • the optimally link groups process 250 after setting MAXCORR and MINCORR, checks 1030 whether MAXCORR is greater than MINCORR. If yes 1033, the group I, that with the head group has the largest average correlation value, is more similar to the head group than the group J, that with the tail group has the largest average correlation value, is similar to the tail group.
  • the optimally link groups process 250 sets 1031 the head of the link of groups to group I, i.e., the current head group is set equal to group I. Group I is the new current head group, and it is linked to the previous head group, groupl .
  • the optimally link groups process 250 also flags 1031 group I as a linked group.
  • the optimally link groups process 250 checks 1030 whether MAXCORR is greater than MINCORR and it is not 1034, the group J, that with the tail group has the largest average correlation value, is more similar to the tail group than the group I, that with the head group has the largest average correlation value, is similar to the head group.
  • the optimally link groups process 250 sets 1032 the tail of the link of groups to group J, i.e., the current tail group is set equal to group J.
  • Group J is the new current tail group, and the previous tail group, group2, is linked to group J.
  • the optimally link groups process 250 also flags 1032 group J as a linked group.
  • the optimally link groups process 250 then checks 1035 whether all non-overlapping groups are linked. If yes 1036, the optimally link groups process 250 optimally links 1040 the compounds in all the linked groups, as further described below. The optimally link groups process 250 is then ended 1038. If, however, there are more groups to link 1037, the optimally link groups process 250 loops 1027 again through all the non-overlapping groups that are not already linked.
  • Exemplary AVE array 1050 of average correlation values, or mean cumulative distance values, for groups, as shown in Figure 26, stores the mean cumulative distance values for five groups of compounds.
  • the bottom half 1051 of the AVE array 1050 is the mirror image of the top half 1052 of the array 1050.
  • the diagonal 1053 of the AVE array 1050 is generally not relevant, as it represents the average correlation value for one group, and not a group pair. In the present example, none of the five groups represented in the AVE array 1050 are flagged as overlapping, i.e., non-usable, groups at this time.
  • Group A - group D have the largest mean cumulative distance value in AVE array 1050, i.e., AVE[0][3] is 0.9.
  • the optimally link groups process 250 sets
  • group A is the head of the link and group D is the tail of the link.
  • Group A and group D are also flagged 1026 as linked.
  • the optimally link groups process 250 then loops 1027 through ail non-overlapping groups that are not already linked, i.e., groups B, C and E.
  • the optimally link groups process 250 locates 1028 the non-linked group that with the current head group has the largest mean cumulative distance value.
  • the non-linked group B with the head group A has the largest mean cumulative distance value, i.e., AVE[0][1] equals 0.7.
  • the optimally link groups process 250 sets 1028 the variable MAXCORR equal to AVE[0][1], i.e., 0.7.
  • the optimally link groups process 250 also locates 1029 the group that with the current tail group has the largest mean cumulative distance value.
  • the non-linked group C with the tail group D has the largest mean cumulative distance value, i.e., AVE[3][2] equals 0.6.
  • the optimally link groups process 250 sets 1029 the variable MINCORR equal to AVE[3][2], i.e., 0.6.
  • the optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR. It is 1033, as MAXCORR is now equal to 0.7 and MINCORR is equal to 0.6, so the optimally link groups process 250 links 1031 the non-linked group B associated with MAXCORR as the new head group.
  • the original head group, i.e., group A, is linked to the new head group B. At this time, the link of groups is as shown in Equation 12.
  • Equation 12 group B is the head of the link and group D remains the tail of the link.
  • Group B the new head group, is also flagged 1031 as linked.
  • the optimally link groups process 250 then checks 1035 whether all groups have been linked. They have not 1037; groups C and E remain to be linked.
  • the optimally link groups process 250 once again loops 1027 through all non-overlapping, non-linked groups, i.e., groups C and E.
  • the optimally link groups process 250 locates 1028 the group that with the current head group, i.e., group B, has the largest mean cumulative distance value.
  • group B has the largest mean cumulative distance value.
  • the non-linked group C with the head group B has the largest mean cumulative distance value, i.e., AVE[1][2] equals 0.8.
  • the optimally link groups process 250 sets 1028 the variable MAXCORR to 0.8.
  • the optimally link groups process 250 also locates 1029 the group that with the current tail group, i.e., group D, has the largest mean cumulative distance value.
  • group D has the largest mean cumulative distance value.
  • the non-linked group C with the tail group D has the largest mean cumulative distance value, i.e., AVE[3][2] equals 0.6.
  • the optimally link groups process 250 sets 1029 the variable MINCORR to 0.6.
  • the optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR. It is 1033, as MAXCORR is now equal to 0.8 and MINCORR is equal to 0.6, so the optimally link groups process 250 links 1031 the non-linked group C associated with MAXCORR as the new head group.
  • the original head group, i.e., group B, is linked to the new head group C. At this time, the link of groups is as shown in Equation 13.
  • Equation 13 group C is the head of the link and group D remains the tail of the link.
  • Group C is also flagged 1031 as linked.
  • the optimally link groups process 250 then checks 1035 whether all groups have been linked. They have not 1037; group E still remains to be linked. Thus, the optimally link groups process 250 once again loops 1027 through all non-linked, non-overlapping groups, i.e., group E.
  • variable MAXCORR is set to the average correlation value for the head group C - group E pair, i.e., AVE[2][4] equal to 0.5.
  • variable MINCORR is set to the average correlation value for the tail group D - group E pair, i.e., AVE[3][4] equal to 0.5.
  • the optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR.
  • Group E is also flagged 1032 as linked.
  • the optimally link groups process 250 then checks 1035 whether all groups have been linked. They are 1036, so the optimally link groups process 250 optimally links 1040 the compounds in all the linked groups, as further described below. The optimally link groups process 250 then ends 1038.
  • an optimally link groups process 2020 first writes 1079, or otherwise copies, the mirror image of the top half of the average correlation, or mean cumulative distance, value array, e.g., AVG, to the bottom half of the array.
  • the optimally link groups process 2020 then loops 1080 through all non- overlapping compound groups, i.e., all groups that are not marked as overlapping, and locates the two groups with the largest average correlation value, or mean cumulative distance value.
  • the optimally link groups process 2020 sets 1080 a variable, e.g., TOK1 , to group I and sets 1080 a second variable, e.g., TOK2, to group J.
  • the TOK1 group is also set 1080 as the original head of the link of groups and the TOK2 group is set 1080 as the original tail of the link of groups.
  • the optimally link groups process 2020 also flags 1081 the TOK1 group and the TOK2 group as linked.
  • the optimally link groups process 2020 then loops 1082 through all non- overlapping, non-linked groups I.
  • the optimally link groups process 2020 checks 1083 whether all groups have been linked. If no 1086, the optimally link groups process 2020 locates 1085 the TOK1 head group - non-linked group I pair with the largest mean cumulative distance value, e.g., the largest value of AVE[TOK1][l].
  • the optimally link groups process 2020 sets 1085 a variable, e.g., MAXCORR, to the largest value of AVE[TOK1][l].
  • the optimally link groups process 2020 uses the first I group as the group corresponding to MAXCORR. For example, if the mean cumulative distance value for the TOK1 - group four pair is 0.9 and the mean cumulative distance value for the TOK1 - group seven pair is also 0.9, and 0.9 is the largest mean cumulative distance value for all TOK1 group - non-linked group I pairs, then the optimally link groups process 2020 sets MAXCORR equal to 0.9 and associates the non-linked fourth group with MAXCORR.
  • the optimally link groups process 2020 also locates 1087 the TOK2 tail group - non-linked group I pair with the largest mean cumulative distance value, e.g., the largest value of AVE[TOK2][l].
  • the optimally link groups process 2020 sets 1087 a variable, e.g., MINCORR, to the largest value of AVE[TOK2][l].
  • the optimally link groups process 2020 uses, or otherwise associates, the first I group as the group corresponding to MINCORR. Once the optimally link groups process 2020 locates a current MAXCORR and a current MINCORR, it checks 1090 whether MAXCORR is greater than MINCORR. If yes 1088, the optimally link groups process 2020 sets 1091 the non- linked group I associated with MAXCORR as the new head group. The previous head group, TOK1 , is linked to the new head group I. The new head group I is also flagged 1093 as linked. The variable TOK1 is also set 1095 to the new head group I. The optimally link groups process 2020 then loops 1082 once again through all the non-overlapping, non-linked groups.
  • the optimally link groups process 2020 sets 1092 the non-linked group I associated with MINCORR as the new tail group.
  • the previous tail group, TOK2 is linked to the new tail group I.
  • the new tail group I is also flagged 1094 as linked.
  • the variable TOK2 is also set 1096 to the new tail group I.
  • the optimally link groups process 2020 then loops 1082 once again through all the non-overlapping, non- linked groups.
  • the optimally link groups process 2020 loops H 1097 times, once for each group in the link of groups.
  • the optimally link groups process 2020 loops 1097 from the head group to the tail group of the link of groups. Once all the groups H have been looped 1097 through, the optimally link groups process 2020 is ended 2021.
  • the optimally link groups process 2020 first checks 1098 whether the current group H is the head group of the link of groups. If it is 2000, the optimally link groups process 2020 sets 2001 a first variable, e.g., A, to the correlation value of the first compound in the head group and the first compound in the second group in the link of groups. With DATA as the array of correlation values for all compound pairs to be clustered, A is set to DATA[1 st compound in Head group][1 sl compound in 2 nd group]. The optimally link groups process 2020 also sets 2001 a second variable, e.g., B, to the correlation value of the first compound in the head group and the last compound in the second group in the link of groups. Thus, B is set to DATA[1 st compound in Head group][Last compound in 2 nd group].
  • A the correlation value of the first compound in the head group and the first compound in the second group in the link of groups.
  • B is set to DATA[1 st compound in Head group][Last
  • the optimally link groups process 2020 also sets 2001 a third variable, e.g., C, to the correlation value of the last compound in the head group and the first compound in the second group in the link of groups.
  • C is set to DATA[Last compound in Head group][1 st compound in 2 nd group].
  • the optimally link groups process 2020 also sets 2001 a fourth variable, e.g., D, to the correlation value of the last compound in the head group and the last compound in the second group in the link of groups.
  • D is set to DATA[Last compound in Head group][Last compound in 2 nd group].
  • the optimally link groups process 2020 then checks 2002 whether C or D is greater than or equal to A and B. Thus, the optimally link groups process 2020 checks 2002 whether either of the correlation values of the head group tail compound - second linked group head and tail compound pairs are greater than or equal to both the correlation values of the head group head compound -second group head and tail compound pairs. If the head group tail compound generates the same or larger correlation value 2003, the compounds in the head group are stored in a table, or list, e.g., LIST, of compounds, in head to tail order. This is because the head group tail compound is more similar to, and, thus, should be linked closest to, the compounds in the second group.
  • a table, or list, e.g., LIST e.g., LIST
  • the compounds in the head group are stored in LIST in tail to head order. This is because the head group head compound is more similar to, and, thus, should be linked closest to, the compounds in the second group.
  • 2020 loops 2005 through all the compounds in the head group, storing 2007 the compounds in LIST, in head to tail order. Once all the compounds in the head group are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
  • the optimally link groups process 2020 loops 2006 through all the compounds in the head group, storing 2008 the compounds in LIST, in tail to head order. Once all the compounds in the head group are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
  • the optimally link groups process 2020 checks 2009 whether the correlation value of the last compound in LIST, i.e., the last compound in the previous group H stored in
  • the optimally link groups process 2020 checks whether DATAfLast compound in List][Head compound in current group H] is greater than or equal to DATA[Last compound in List][Tail compound in current group H].
  • LIST has the same or larger correlation value 2011 , the compounds in group H are stored in LIST in head to tail order. This is because the head compound in group H is more similar to, and, thus, should be linked closest to, the last compound stored in
  • the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
  • the optimally link groups process 2020 loops 2012 through all the compounds in current group H, storing 2014 the compounds in LIST in tail to head order. Once all the compounds in the current group H are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
  • the optimally link groups process 2020 ends 2021.
  • the pattern recognition oriented cluster process 200 executes a generate pattern matrix process 255.
  • the generate pattern matrix process 255 generates and outputs, e.g., to an output file or a computer terminal screen, a pattern matrix of the grouped compounds of the set of compounds to be clustered.
  • the generate pattern matrix process 255 creates a two-dimensional color-shaded cluster graph representative of the correlation values of the clustered, i.e., grouped, compounds.
  • a generate pattern matrix process 255 as shown in Figure 29, the titles "class", "compound” and “Code” are printed 2025 in a first output file text row, or line.
  • the generate pattern matrix process 255 then loops X 2026 times, once for each of the set of compounds to be clustered.
  • the compounds X are looped through from the first compound stored in the table LIST to the last compound stored in the table LIST.
  • compounds are stored in the table LIST in optimal order in the optimally link groups process 2020.
  • the generate pattern matrix process 255 prints 2027 the respective compound code value in a unique column in the first output file text row, after the "Code" title.
  • the generate pattern matrix process 255 also prints 2028 the respective compound class description, compound name and compound code value for compound X, under the respective titles.
  • One compound class description, name and code value is printed per output file text row, or line, beginning with a second output file text line.
  • the respective compound class description for compound X was previously stored in table DESC[X] when the input file for the set of compounds to be clustered was processed by the input data process 350.
  • the respective compound name for compound X was previously stored in table NAME[X] when the input file for the set of compounds to be clustered was processed by the input data process 350.
  • the respective compound code for compound X was previously stored in table CODE[X] when the input file for the set of compounds to be clustered was processed by the input data process 350.
  • the generate pattern matrix process 255 only outputs the title "NAME" to an output file, in a first text line.
  • the generate pattern matrix process 255 for each compound in the set of compounds to be clustered, prints only the respective compound name to the output file, one name per output file text line, or row, beginning with a second output file text line. Each compound name is correspondingly printed in a respective column in a first output text line, or row, of the output file.
  • the generate pattern matrix process 255 prints the titles "mode", “structure”, “compound” and “code” in a first text line, or row, in an output file.
  • the generate pattern matrix process 255 prints the respective compound code values for each compound in the set of compounds to be clustered, one per column, in the first text line in the output file.
  • the generate pattern matrix process 255 also prints the respective compound mode, if there is one, compound structure, if there is one, and compound name for each compound in the set of compounds to be clustered, under the appropriate titles.
  • One compound mode, structure and name information for one compound is printed per text line, or row, of the output file, beginning with a second text line.
  • Exemplary cluster graph 2055, shown in Figure 30, is an example of a cluster graph generated in this alternative embodiment.
  • the generate pattern matrix process 255 loops I times 2029, once for each compound in the set of compounds to be clustered.
  • the generate pattern matrix process 255 then loops J times 2030, once for each compound in the set of compounds to be clustered.
  • the generate pattern matrix process 255 loops J 2030 times equal to the number of compounds in the set of compounds to be clustered.
  • the generate pattern matrix 255 determines 2031 which color group, e.g., COLOR[X], the correlation value for the compound I - compound J pair, e.g., DATA[I][J], is within.
  • the color groups i.e., COLOR[X]
  • COLOR[X] and their respective correlation value ranges were previously established during the set grouping parameters process 220.
  • a presently preferred embodiment default number of color groups is six and a presently preferred embodiment default correlation value range for each color group is shown in Table 2.
  • COLOR[X] group the DATA[I][J] correlation value is within, it prints 2032 to an output file a block of color corresponding to the respective COLOR[X] group.
  • the generate pattern matrix process 255 also prints 2032 to an output file, in the respective block of color, the value of COLOR[X], i.e., the lower value of the range of correlation values for respective COLORfX].
  • the block of color corresponding to COLORfX] and the respective value of COLORFX] are printed in the row compound I - column compound J of the output file.
  • the generate pattern matrix process 255 then loops X 2034 times, once for each color group COLOR. For each COLOR[X] group, the generate pattern matrix process 255 prints 2036 to the output file a block of color corresponding to the respective COLOR[X] group. In a presently preferred embodiment, the generate pattern matrix process 255 also prints 2036 to the output file, in the respective block of color, the value of COLOR[X], i.e., the lower value of the range of correlation values for respective COLORfX]. The block of color and respective value of COLOR[X] are printed under the title "COLOR_CODE".
  • the generate pattern matrix process 255 also prints 2037 to the output file the correlation range corresponding to the respective COLORfX] group.
  • the correlation range for the COLORfX] group is printed under the title "CORRELATION”.
  • the generate pattern matrix process 255 generates 2038 a carriage return before looping 2034 to the next COLOR[X] group.
  • the generate pattern matrix process 255 is ended 2035.
  • the pattern recognition oriented cluster process 200 is also ended 260.
  • the methods and apparatus of the present invention provide versatile tools for evaluating sets of random objects and clustering them into groups based on predefined characteristic(s) of the objects.
  • the random objects are chemical compounds and the predefined characteristic is similarity of biological activity.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to methods and apparatus for evaluating sets of random objects and clustering them into groups based on predefined characteristic(s) of the objects. A group of objects are evaluated against a set of environments. The resulting data obtained is assessed by pattern recognition cluster analysis methods and apparatus to reveal objects with similar activity with the set of environments. The invention also relates to the above methods in which the group of objects is a group of chemical compounds and the set of environments is a set of bacterial mutant strains or a set of gene expressions.

Description

DESCRIPTION PATTERN RECOGNITION ORIENTED CLUSTER ANALYSIS
Related Applications
This application is related to and claims priority from provisional application serial 60/112,786 filed December 18, 1998 and incorporated by reference, including all drawings and figures, as if fully set forth herein.
Field of Invention
The field of this invention relates to chemistry, biochemistry, statistical analysis and computer science. In particular, it relates to methods and apparatus for discerning characteristic similarities among groups of objects; more particularly, similarities among chemical compounds based on biological activity.
Background of the Invention
The following is offered as background information only and is not intended nor admitted to be prior art to the present invention. The production of chemical diversity, i.e., the production of molecules having different chemical structures, is of great interest to the pharmaceutical industry because it offers the potential for discovery of new classes of therapeutic agents, i.e., classes of compounds useful in the treatment of diseases and disorders of higher organisms such as man and other mammals. Several recent advances have enabled the expanded production of chemical diversity. Such advances include solid state combinatorial chemistry methods and new methods of obtaining large numbers of diverse natural products.
"Chemical space" is a term often used to describe the universe of all possible chemical compounds. The size of theoretical "chemical space" is enormous. As a consequence, the number of potential compounds comprising chemical space is far larger than the number of compounds that can realistically be hoped to be synthesized. Furthermore, it is generally unknown how much of chemical space is valuable; i.e., in pharmacological terms, how much of it contains compounds with useful biological activity. Compounding the situation is the fact that small changes in molecules can often have profound effects on biological activity. Thus, finding that a particular molecule located at a particular point in chemical space is not biologically active does not necessarily mean that the entire region of chemical space around that molecule is likewise inactive.
It is, however, in a global sense, reasonable to assume that some areas of chemical space are richer in biological activity than others. Ideally then, researchers that work to produce chemical diversity generally strive to map regions of chemical space and not simply to test isolated compounds, the idea being to identify those regions of chemical space that are relatively rich in compounds with biological activity.
Within the universe of chemical space lies the sub-universe of "protein space." Again, in theory, protein space is also very large. However, practically speaking, the size of protein space that is used by living organisms is relatively modest. Genomic studies suggest that, across the five kingdoms of living organisms (animals, plants, fungi, bacteria and algae), there are only between twenty and fifty protein "super families," i.e., families of proteins containing amino acid sequences with significant similarities to sequences in other proteins. The implication of this is that the regions of sequence similarity should have similar folding patterns and, therefore, are likely to be evolutionarily related.
The similarities of amino acid sequences among the proteins in all the biological kingdoms renders it likely that compounds that affect the activity of proteins in one kingdom will also affect the activity of proteins in the other kingdoms. For example, compounds that affect the activity of bacterial proteins may be expected to be generally bioactive, i.e., active against proteins in the plant kingdom, the fungi kingdom, the algae kingdom and even the animal kingdom, which includes man. Thus, without limitation, the process of bacterial bio-profiling; i.e., testing compounds against a panel of bacteria containing different proteins, may reveal compounds which can be expected to have general activity against proteins of living organisms. By testing a large number of chemical species of both similar and divergent molecular structure, regions of chemical space rich in bioactivity, and therefore rich in potential pharmaceutical utility, may be found. The compounds in the bioactivity-rich regions can then be subjected to closer scrutiny to arrive at compounds having optimal activity against specific biological targets such as, without limitation, proteins having particular functions in an organism.
Frequently, researchers producing chemical diversity will do so by creating "libraries" of compounds; i.e., large numbers of compounds of related structure, often based on a "lead" compound known to have biological activity. In this manner, the researchers can potentially arrive at a great deal of information; however, that information is restricted to a very small region of chemical space. In order to explore large regions of chemical space, it is generally necessary to test libraries of compounds; i.e., compounds with no obvious, perhaps not even intuitively, discernable structural relationship. The problem then becomes, as biological activity among the diverse compounds is revealed, how to determine which of the compounds define regions of chemical space rich in bioactivity. There is also the further problem of how to determine what the defining parameters of a bioactivity-rich region of chemical space are.
Methods for arriving at the raw data; i.e., the bioactivity of individual compounds in a large group of compounds against a large group of proteins, are available. See, for instance, U. S. Pat. App. Serial No. 08/377,329 which is incorporated herein by reference as if fully set forth herein. The problem is how to develop useful information from the raw data, e.g., how to determine which of a group of compounds of disparate chemical structure have similar biological activity and, further, how those compounds of similar biological activity relate to one another.
One procedure currently in use to derive useful information from compound vs. biological activity raw data is COMPARE, a computer program developed for the National Cancer Institute. COMPARE correlates similarities among growth inhibition patterns of chemical compounds. Compounds are first tested against sixty human cancer cell lines in disease-specific panels. The growth inhibition patterns of the tested compounds are then subjected to COMPARE analysis, which generates predictions about similarities, usually with regard to mechanism of action or molecular target. See, e.g., K. D. Paull, et al., Journal of the National Cancer
Institute. 1989, 81(14):1088-1092; J. N. Weinstein, et al., Science. 1992, 258: 447- 451; S. E. Bates, et al., J. Cancer Res. Clin. Oncol.. 1995, 121 :495-500. To use COMPARE, a "seed compound"; i.e. a compound of known activity, is tested against the disease-specific panels, and a "fingerprint" of the compound's activity, known as a "mean graph" is generated. The compounds to be compared are then also tested against the disease-specific panels and their respective mean graphs generated. COMPARE then ranks the compounds according to their mechanism of action or molecular target by calculation of Pearson correlation coefficients. COMPARE treats only one compound/target protein pair at a time and only compares raw data, i.e., a value corresponding to the activity of one compound against a panel of biological targets, with similar raw data of another compound.
Another approach to evaluating groups of objects is to analyze the entire group of objects simultaneously rather than one pair at a time. This approach is called "cluster analysis." Cluster analysis involves procedures for taking a group of objects and "clustering" them in subgroups which differ from other subgroups in meaningful ways. With regard to chemical compounds, a "meaningful way" that subgroups can be formed from a large group of chemical compounds is by their performance properties such as, without limitation, mode action against biological targets.
A method for mapping coherent patterns of data among groups of compounds, i.e., for cluster analysis, is described in J. N. Weinstein, et al., Science. 1997, 275:343-349. The method is embodied in a computer program called DISCOVERY which performs the analysis of the raw data of compound vs. biological target and displays the results in "cluster correlation" maps.
There are several problems with conventional methods that use euclidean distance as a metric for a similarity measure however. One of these is false similarity recognition. An example of this is shown in Figures 1-3. The table 50 in Figure 1 is comprised of three objects, A 10, B 20 and C 30. Each object has interacted with each of four operators, M1 5, M2 15, M3 25 and M4 35. In table 50, each object-operator pair has an associated raw data value 40, representing the interaction between each object and each operator. By simply mapping each of the object-operator pair raw data values, as shown in Figure 2, the graph 60 for object A 10 appears similar to the graph 70 for object B 20, and dissimilar to the graph 80 for object C 30. Thus, one would normally conclude that objects A 10 and B 20 are similar, and object C 30 is dissimilar. However, referring to Figure 3, object A 10 and object C 30 appear nearly identical when their object-operator pairs are graphed using similarity scaled axes. Graph A 90 and graph C 94 have a virtually identical shape in Figure 3, while graph B 92 is clearly dissimilar to both graphs A 90 and C 94. Thus, using conventional systems, a false similarity is determined between objects A 10 and B 20, while a true similarity between object A 10 and object C 30 is left undiscovered.
Furthermore, with most current systems for analyzing compounds, the results are generally displayed as dendograms, or tree graphs. For instance, when compound interrelationships are desired from COMPARE generated data, dendograms are prepared by using auxiliary methods such as Ward's method and a Euclidean distance metric. See, for example, Bates. J., Cancer Res. Gin. Oncol.. 1995, 121 :495, 497, Figure 2. Figure 4 is an example of a simple dendrogram 100. As can be seen, the resultant analysis is limited by the fact that the information contained within the dendogram 100 is not presented in a manner optimally conducive to direct, i.e., visual, human interpretation of the interrelationships among the objects. For example, while it is clear that objects A 102 and B 104, as displayed in dendogram 100, are similar, there is no presentation of the relative similarity of object A 100 to objects G 106 or H 108. Further, even though objects A 102 and B 104 and objects G 106 and H 108 are each presented as similar in dendogram 100, there is no indication of whether objects A 102 and B 104 are more, or less, or as similar as objects G 106 and H 108.
Finally, in conventional cluster analysis, such as that employed in DISCOVERY, an initial subjective decision must be made as to how clusters will be formed and each means of forming clusters has its limitations. Two popular clustering techniques, or methods, the "hierarchical" and the "partitioning" techniques, both have inherent shortcomings. In the hierarchical method, all data points are initially considered to be single point clusters or, alternatively, the entire set of data is considered one large cluster. One of two things then happens. By a variety of algorithmic manipulations, point clusters are successively joined together to form larger and larger clusters, the agglomerative technique, or the initial large cluster containing all data points is successively split into smaller and smaller clusters, the divisive technique. One problem with this hierarchical method is when to stop, i.e., how to determine when clusters revealing optimal interrelationships between compounds have been generated. With the agglomerative approach, eventually all data points are joined into one large cluster; with the divisive approach all data points eventually end up as individual point clusters. Another problem specific to the hierarchical method is that once an object is allocated to a particular cluster, that allocation, whether or not it is an optimal allocation, is irrevocable; that is, once an object joins a cluster it is never removed from that cluster and fused, or otherwise joined, with objects belonging to some other cluster.
The partitioning method of cluster analysis does not require that allocation of an object to a cluster be irrevocable. That is, objects may be reallocated if their initial assignments are found to be inaccurate, or otherwise unsatisfactory. However, the partitioning method suffers from the general requirement that the number of final clusters be known and specified in advance; this limitation is a serious shortcoming if the goal is in fact to determine how many clusters exist in a particular group.
Thus, a method is needed which can quickly, accurately and in a dynamic manner, i.e., a manner which permits reallocation of objects from cluster to cluster, as permitted by the partitioning technique but not the hierarchical technique, but which does not also require fore-knowledge of the final number of clusters, a limitation not present in the hierarchical technique, but required in the partitioning technique, extract from raw data the maximum amount of information available regarding the interrelationships among all objects in a particular group, including classification of the objects into correct subgroups, while at the same time minimizing the misclassification of objects into wrong subgroups. It would be further advantageous if the method is fully user interactive, scalable, flexible and robust. The present invention provides such a method.
Summary of the Invention In one aspect, the present invention relates to a method for evaluating sets of random objects and clustering, or otherwise grouping, them into groups based on predefined charactehstic(s) of the objects. In a presently preferred embodiment, the invention relates to a method and apparatus for evaluating a group of chemical compounds and clustering them into sub-groups based on similarity of biological activity. In a further presently preferred embodiment, bacterial mutant strains are employed to evaluate a group of chemical compounds. The raw data obtained is assessed by the pattern recognition cluster analysis methods and apparatus of this invention to reveal compounds having similar biological activity. A method for evaluating the similarity of biological activity in a group of chemical compounds using a panel of organisms having disparate gene expressions is yet another presently preferred embodiment of this invention. The ability of the chemical compounds to affect the expression of the different products of the various genes is detected and provides a measure of the similarity of biological activity of the compounds within the group being tested.
By a "group" of compounds is meant as few as two compounds to as many as 109 compounds.
By "similar biological activity" is meant compounds having a similar pattern of activity against a panel of biological targets, such as, for example and without limitation, a panel of proteins. A similar pattern of activity may be manifested simply as a plus or minus; i.e., either a chemical has an effect against a protein in the panel or it does not. Thus, using the simple plus/minus approach, two compounds display a "similar biological activity" if they are plus against the same proteins in a panel of proteins and minus against the same proteins in the panel. A similar pattern of activity may also be a more complex similarity relating to such things as, without limitation, the biochemical target of the effect, the manifestation of the effect or the amount of the effect. For example, an effect may be manifested by, again without limitation, a change in cell phenotype or a change in the ability of a protein to perform it biological function. "Bacterial mutant strain" or "mutant strain" refers to a strain of bacteria in which biochemical activity has been modified such that the bacteria exhibit a diminished or an enhanced level of activity with regard to a selected parameter when compared to normal bacteria of the same specie. Examples of such parameters are, without limitation, the ability of bacteria to grow at different temperatures, at different pHs, in the presence of different nutrients, etc. When the expression of diminished or enhanced activity is linked to a particular biomolecule, a change in the level of activity in the presence of a chemical being tested indicates that the chemical is affecting the biomolecule either directly or indirectly; i.e. by interacting with the biomolecule itself of by interacting with another molecule on which the biomolecule relies to perform its function.
By "gene expression" is meant a process by which a living organism manufactures chemical products under the direction of a gene. Gene expression can either be a wild type, i.e., the manufacture of a chemical produced by the organism in its natural state, or it may be engineered. By "engineered" is meant that genome of the organism is altered such that a gene which expresses a non-natural (for that specie) chemical is incorporated into the genome. Examples, without limitation, of genes which may be engineered into an organism's genome and which express chemicals that are readily detectable are the lux gene, which expresses the enzyme luciferase, and the cat gene, which expresses the enzyme chloramphenicol acetyl transferase. As used herein, "gene expression" also refers to an environment containing organisms which harbor the genes and which express selected chemicals. While growth inhibition of mutant strains and gene expression are presently preferred embodiments of this invention, it is understood that numerous other indications of similar biological activity including, but not limited to, other biochemical assays, other whole cell assays and the like are within the spirit and scope of this invention. In a presently preferred embodiment, two or more objects of a set of objects are grouped into one or more groups, in a presently preferred embodiment, an object similarity score is generated for each pair of objects in the set of objects to be grouped, or clustered. Two or more objects are then assigned, or grouped, into one or more groups of objects. The criteria for assignment of an object to a particular group is the object similarity scores generated for the pairs of objects to be clustered. In a presently preferred embodiment, the objects of an established group are ordered. The criteria for ordering the objects of a group are the object similarity scores for the pairs of objects of the group.
In a presently preferred embodiment, a group similarity score is generated for each pair of groups of objects. In a presently preferred embodiment, the groups of objects are then ordered, based on the group similarity scores.
In another presently preferred embodiment, a pattern matrix is generated for a set of groups of objects. In a presently preferred embodiment, the generated pattern matrix provides a visual representation of the grouping and similarity, or relative similarity, of the grouped objects represented in the respective matrix.
Thus, a general object of the invention is to provide a method and apparatus for optimally clustering, or grouping, a group of objects. In a presently preferred embodiment, the grouping is based on the objects' similarity scores with each other. In a presently preferred embodiment, the invention provides a method and apparatus for optimally clustering a group of compounds, based on the similarity of the compounds' interactions with various bacterial mutant strains or gene expressions. A further general object of the invention is to provide a method and apparatus for displaying the results of a clustering of groups of objects. In a presently preferred embodiment, a pattern matrix is generated which provides a visual representation of both the grouping and the similarity, or relative similarity, of the grouped objects represented in the respective matrix.
While the present invention is described herein in the context of a two- dimensional analysis, i.e., one dimension being chemical compounds and the other being mutant strains, it is understood that the methods and apparatus disclosed are in fact multi-dimensional. That is, clustering of objects can be performed with relation to more than one environment. This can be accomplished using a three- dimensional analysis, in which the pattern matrix generated will likewise be three- dimensional. For example, without limitation, the X-axis of a plot, or pattern matrix, could be chemical compounds, the Y-axis could be chemical compounds and the Z- axis could be respective correlation values. In this example, the clusters of objects could be visualized as peaks wherein the strength of correlation would be indicated by the height of the peak. The methods and apparatus are not limited to three-dimensions either, and clusters could be mathematically formed with regard to as many simultaneous environments as desired. However, a pattern matrix may, of course, only visualize three separate parameters at a time although two or more parameters may be simultaneously displayed on a same axis. Other and further objects, features, aspects and advantages of the invention are disclosed and will become better understood in the following figures and detailed description of the invention. Brief Description of the Tables
Table 1 depicts a presently preferred embodiment of groups of correlation ranges.
Table 2 depicts a presently preferred embodiment of a default correlation value range and a default lower correlation value for each of a default number of color groups.
Brief Description of the Figures
Figure 1 depicts an exemplary object-operator table. Figure 2 is a representative graph of the object-operator interrelationship from the table of Figure 1 , without any correlation or similarity measurement scaling.
Figure 3 depicts representative graphs of the object-operator interrelationships from the table of Figure 1 with similarity measurement scaling. Figure 4 depicts an exemplary dendogram of a resultant cluster analysis. Figure 5 depicts a presently preferred embodiment of a pattern recognition oriented cluster method.
Figure 6 depicts a presently preferred embodiment of a cluster process. Figure 7 depicts a presently preferred embodiment exemplary pattern matrix output. Figure 8 depicts a presently preferred embodiment pattern recognition oriented cluster method processing flow.
Figure 9 depicts a presently preferred embodiment input data processing flow. Figures 10A and 10B depict a presently preferred embodiment input data processing flow. Figure 11 depicts an exemplary temporary array of raw data for compound - mutant strain pairs.
Figure 12 depicts an exemplary DATA array of correlation values for n compounds.
Figure 13 depicts a presently preferred embodiment calculate correlation value processing flow.
Figure 14 depicts a presently preferred embodiment determine correlation distribution processing flow. Figure 15 depicts a presently preferred embodiment set grouping parameters processing flow.
Figure 16 depicts a presently preferred embodiment group compounds processing flow. Figures 17A and 17B depict exemplary correlation value tables.
Figure 18 depicts an exemplary correlation value table. Figures 19A, 19B and 19C depict a presently preferred embodiment group compounds processing flow.
Figure 20 depicts a presently preferred embodiment determine overlapping groups processing flow.
Figures 21 A and 21 B depict a presently preferred embodiment combine groups processing flow.
Figure 22 depicts an exemplary array of correlation values for seven compounds. Figures 23A and 23B depict a presently preferred embodiment optimally link compounds in group processing flow.
Figure 24 depicts an exemplary correlation value array for a group of twelve compounds.
Figure 25 depicts a presently preferred embodiment optimally link groups processing flow.
Figure 26 depicts an exemplary array of mean correlation distance values for five groups of compounds.
Figures 27A, 27B, 27C, 27D and 27E depict a presently preferred embodiment optimally link groups processing flow. Figure 28 depicts a presently preferred embodiment exemplary pattern matrix output.
Figure 29 depicts a presently preferred embodiment generate pattern matrix processing flow.
Figure 30 depicts an alternative embodiment exemplary pattern matrix output.
Detailed Description of the Invention In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art, that the invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the invention. Thus, it is understood that, while the presently preferred embodiment of this invention relates to the analysis of similarities among chemical compounds based on biological activity, applying the method to analyze the similarity of any set of objects based on any selected criterion or criteria is within the scope and spirit of this invention.
Also, in the following description, for purposes of description and explanation, exemplary and/or presently preferred embodiment variable names are provided for various entities used in the described processes. The variable names are thereafter used to describe the respective entity. It will be apparent, however, to one skilled in the art, that the invention may be practiced with other variable names and/or other complimentary entities, without departing from the spirit and scope of the invention. Thus, in a presently preferred embodiment of a pattern recognition oriented cluster method 110, as shown in Figure 5, raw data is gathered 112 on a set of objects to be clustered. In a presently preferred embodiment, the objects to be clustered are chemical compounds, and each raw data value, or compound value, is an interaction value of the respective compound in an environment. In a presently preferred embodiment, an environment is a mutant strain and each raw data value of a compound is an interaction value between the compound and a mutant strain of a set of mutant strains. A further presently preferred embodiment of this invention is that the environment is a gene expression.
A similarity measure, or object similarity score, 113 is then generated for each of the object pairs that can be constructed from the set of objects to be clustered. In a presently preferred embodiment, a correlation value, or coefficient, 114 for a respective object pair is generated from the raw data points, or compound values, for each respective object of the object pair; i.e., from the objects' interaction values with a set of mutant strains. In a presently preferred embodiment, the correlation value 114 of an object pair represents the similarity of the interactions of the objects of the pair with a set of mutant strains or gene expressions. The correlation value 114 of one object pair is then compared to the correlation values 114 of every other object pair as a similarity value, or coefficient, 116 of the objects comprising the object pairs. In general, the higher the correlation value 114, which represents a similarity value 116, of an object pair, the more similar the objects of the object pair are deemed to be.
Once the correlation values are established, the objects, e.g., compounds, of the set of objects are clustered, or grouped, 118, using a combination of clustering processes, or techniques, 117. First, closely related objects, e.g., compounds, are clustered into respective groups. In a presently preferred embodiment, a method of dynamic linkage is used to create groups of similar objects. The groups are then analyzed, to discard those that overlap with other, larger groups, comprising the same objects. The groups are then merged, in order that one particular object is a member of only one group. Each group is then optimized, so that the objects in the group are ordered according to their relative similarity to each other. Finally, all of the groups are optimized, so that each group is ordered, in relation to every other group, according to the relative similarity of the objects comprising the respective groups.
A process for cluster display 119 is then performed. In a presently preferred embodiment, a pattern matrix is generated and output 120, for displaying the resultant groupings of the set of objects to be clustered. In a presently preferred embodiment, the pattern matrix provides a visual representation of the grouping, similarity, and relatively similarity of the objects in a set of objects to be clustered. In a presently preferred embodiment, the resultant pattern matrix comprises all the objects in all the optimized groups arranged along both an X-axis and an Y- axis. The X-axis and Y-axis are themselves comprised of the objects of the set of objects to be clustered. In a presently preferred embodiment, different colors, or shades of color, corresponding to different respective correlation ranges, are established in a group of color codes, to enhance the appearance of the resultant output matrix. The corresponding correlation values, i.e., similarity measures, of each object pair, for the set of objects to be clustered, is mapped into a respective color code group, and a block of the respective color is output to the pattern matrix. This provides user- friendly visual inspection and identification of closely correlated objects, as well as the established groups of objects. An exemplary pattern matrix 185, in black and white, referring to Figure 7, highlights the visual inspection and identification features of a presently preferred output pattern matrix.
In a presently preferred embodiment of a cluster process 150, as shown in Figure 6, a method for the dynamic linkage of compounds into groups 155 comprises producing all the initial groups of compounds that meet either default, or user- defined, criteria. Groups with shared compounds are then merged 160, in order that only one object is in any one group; i.e., the objects in each of the established groups of objects are mutually exclusive.
A method of intra-group optimization 165 then optimizes the order of the objects within each group. In a presently preferred embodiment, the ordering is accomplished using a nearest neighbor mapping process 170. In a presently preferred embodiment, the two objects in a group, i.e., the first object and the second object, that are most similar in the group, i.e., their correlation value, or similarity measure, is the highest, are located and linked together. One of the two linked objects, e.g., the first object, is designated the head of the link of objects, and the other of the two linked objects, e.g., the second object, is designated the tail of the link of objects.
Then the object, i.e., a third object, of the group, if there is one, that is most similar to either one of the first two objects, i.e., a nearest neighbor object, is located. If the third object is more similar to the first object, it is linked next to the first object; i.e., it becomes the new head of the link of objects for the group. In other words, if the correlation value for the first and third objects of the group is larger, or greater, than the correlation value for the second and third objects of the group, the third object is linked next to the first object. Otherwise, if the third object is more similar to the second object, it is linked next to the second object; i.e., it becomes the new tail of the link of objects for the group. In other words, if the correlation value for the second and third objects of the group is larger than the correlation value for the first and third objects of the group, the third object is linked next to the second object.
Next, a fourth object of the group, if there is one, is located that is most similar to either of the ends, i.e., the head or the tail object, of the existing link of objects of the group. The fourth object is then linked to the end object in the group link that it is more similar to. The intra-group optimization process 165 proceeds through all the objects of a respective group, until all the objects are linked for the group. The resultant link of objects of a group consists of an ordering of the objects that reflects their general relative similarity to one another.
Following the intra-group optimization process 165 execution, a method of inter-group optimization 175 optimizes the ordering of the groups of objects. In a presently preferred embodiment, a group similarity score is generated for each pair of groups. In a presently preferred embodiment, the group similarity score is a mean cumulative distance value, or average correlation value, for a pair of groups. The mean cumulative distance value for a group pair serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value, the more similar the objects in the two groups generally are.
The ordering of the groups is then accomplished using a nearest neighbor mapping process 180 based on the respective mean cumulative distance values. In a presently preferred embodiment of an inter-group optimization process 175, the two groups, i.e., a first group and a second group, that are most similar, i.e., that have the largest mean cumulative distance value, are located and linked together. One of the two linked groups, e.g., the first group, is designated the head of the link of groups, and the other of the two linked groups, e.g., the second group, is designated the tail of the link of groups.
Then the group, i.e., a third group, if there is one, that is most similar to either the first or second group, i.e., a nearest neighbor group, is located. If the third group and the first group are more similar, i.e., have the higher mean cumulative distance value, the third group is linked next to the first group; i.e., the third group becomes the new head of the link of groups. Otherwise, if the third group and the second group are more similar, i.e., have the higher mean cumulative distance value, the third group is linked next to the second group; i.e., the third group becomes the new tail of the link of groups.
Next, a fourth group, if there is one, that is most similar to one of the end groups, i.e., the head or the tail group, of the existing link of groups is located. The fourth group is then linked to the end group in the link of groups that it is most similar to, i.e., that with that end group it has the higher mean correlation distance value. The inter-group optimization process 175 proceeds through all the groups, until all the groups are linked. The resultant link of groups consists of an ordering of the groups that generally reflects their relative similarity to one another. In a presently preferred embodiment a pattern recognition oriented cluster process 200 is executed by a computer, i.e., a computer or other processing device or entity. In a presently preferred embodiment, the computer program comprising the pattern recognition oriented cluster process 200 is stored, or otherwise resides, on a data storage device, for example, but not limited to, e.g., Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), floppy disk, flexible disk, magnetic tape. CD-ROMs, punchcards or papertape. As shown in Figure 8, upon starting 205 the pattern recognition oriented cluster process 200, an input data process 210 is executed. Generally, the input data process 210 inputs the data relating to the set of objects to be clustered from an input file. If the input data relating to the set of objects to be clustered is raw data, i.e., a correlation value, or similarity measure, has not been generated for the object pairs in the set of objects to be clustered, then the input data process 210 also generates the correlation value for each object pair. Upon inputting all relevant data, the pattern recognition oriented cluster process 200 then executes a determine correlation distribution process 215. Generally, the determine correlation distribution process 215 generates a distribution of the correlation values, i.e., similarity measures, for all object pairs in the set of objects to be clustered. The pattern recognition oriented cluster process 200 thereafter executes a set grouping parameters process 220. Generally, the set grouping parameters process 220 allows the user of the pattern recognition oriented cluster process 200 to choose whether to use default, pre-established, grouping and pattern matrix generation parameters, or input their own. The parameters that a user may specify for clustering, i.e., cluster parameters, include, but are not limited to, an initial correlation group limit parameter, an add-in correlation group limit parameter, a minimum correlation group limit parameter and a minimum average correlation group limit parameter. The parameters that a user may specify for pattern matrix generation include, but are not limited to, the number of color groups to be used in the output pattern matrix and the correlation ranges for each of the respective color groups. The pattern recognition oriented cluster process 200 also executes a group compounds process 225. Generally, the group compounds process 225 groups the objects in the set of objects to be clustered into one or more groups, based on default, or, alternatively, user-specified, cluster parameters. Once all the groups that may be established for the set of objects to be clustered are so established by the group compounds process 225, the pattern recognition oriented cluster process 200 executes a determine overlapping groups process 235. Generally, the determine overlapping groups process 235 determines if there are any established groups whose objects are all contained in at least one other group. If so, the determine overlapping groups process 235 marks the group whose objects are all contained in at least one other group as an overlapping group. In a presently preferred embodiment, the pattern recognition oriented cluster process 200 does no further processing of any and all the groups marked as overlapping.
Once all the groups that may be established for the set of objects to be clustered are so established by the group compounds process 225, the pattern recognition oriented cluster process 200 executes a combine groups process 240. Generally, the combine groups process 240 combines one or more groups that have one or more objects in common. Upon executing the combine groups process 240, the objects in each of the remaining groups are mutually exclusive, i.e.; no one object is in more than one group.
The pattern recognition oriented cluster process 200 also executes an optimally link compounds in group process 245. Generally, the optimally link compounds in group process 245 optimally orders the objects in each group, based on the respective object pairs' correlation values, or similarity measures. The pattern recognition oriented cluster process 200 thereafter executes an optimally link groups process 250. Generally, the optimally link groups process 250 optimally orders the groups, based on group similarity scores generated for each pair of groups. In a presently preferred embodiment, a group similarity score is a mean cumulative distance value for a pair of groups.
The pattern recognition oriented cluster process 200 also executes a generate pattern matrix process 255. Generally, the generate pattern matrix process 255 generates and outputs to, e.g., for example, but not limited to, an output file or a computer terminal screen, a pattern matrix of the grouped objects of the set of objects to be clustered. In a presently preferred embodiment, the resultant pattern matrix provides a visual representation of the grouping, similarity, and relative similarity of the objects in the set of objects to be clustered. Following the execution of the generate pattern matrix process 255, the pattern recognition oriented cluster process 200 is ended 260.
As previously discussed, in a presently preferred embodiment, the objects in the set of objects to be clustered are chemical compounds. Each compound is interacted with, or otherwise in, one or more environments. In a presently preferred embodiment, an environment is a mutant strain or a gene expression. Each raw data value, or compound value, in a respective input file is an interaction value between the respective compound and a mutant strain or gene expression. In a presently preferred embodiment, the object similarity score for each object, e.g., compound, pair is a value indicative of the similarity of the objects in the pair when interacting with the various mutant strains in the set of mutant strains or the various gene expressions in the set of gene expressions.
For ease of description, the invention will be described herein for use in clustering a set of compounds, based on the compounds' interaction with each of a set of mutant strains. However, interaction of compounds with each of a set of gene expressions is also contemplated by the invention herein. Furthermore, one of ordinary skill in the art would understand that the invention can be used with a variety of other objects and/or other derived raw data points, without departing from the spirit and scope of the invention. As previously discussed with regard to Figure 8, after the pattern recognition oriented cluster processing is initiated, or started, 205, an input data process 210 is executed. Generally, the input data process 210 inputs the data relating to the set of compounds to be clustered from an input file. As shown in Figure 9, the input data process 210 allows for either 302 the input file to have raw data, or object similarity score data, for a set of compounds to be clustered. If the input data comprises object similarity score values 304, the data is read in 308 from the input file. Each object similarity score value read in from the respective input file is then stored 310 in an array of object similarity score values, e.g., DATA array. More specifically, each object similarity score for a compound I - compound J compound pair is stored in array Data[l][J]. When all the object similarity scores are read in from the input file and stored in a respective array, the input data process 210 is ended 312.
If, on the other hand, the data in the input file is raw data 306, then the raw data for each compound-mutant strain pair is read into an array. In a presently preferred embodiment, any raw data value for a compound I - mutant strain J pair that is without an upper or lower threshold value is set 316 to the respective threshold value. Thus, if the raw data value for any compound I - mutant strain J pair is above a specified high threshold value, the raw data value for the compound I - mutant strain J pair is set 316 to the high threshold value. Similarly, if a raw data value for any compound I - mutant strain J pair is below a specified low threshold value, the raw data value for the compound I - mutant strain J pair is set 316 to the low threshold value.
Once the raw data for at least two compounds is read in from the input file, an object similarity score for compound I, the latest compound to have its respective raw data read in from the input file, and every other compound J that has already had its data read in from the input file is generated 318. Each generated object similarity score for a compound I - compound J pair is stored 320 in an array of object similarity scores, e.g., DATA array. More specifically, each generated object similarity score for a compound I - compound J pair is stored in a respective DATA[I][J] entry.
When all the raw data for all of the compounds in the set of compounds to be clustered is read in from the input file and the object similarity scores for all the respective compound pairs are generated, the input data process 210 is ended 312. Referring to Figures 10A and 10B, a more detailed presently preferred embodiment of an input data process 350 requests the user to input one of two data formats to a computer, i.e., a computer or other processing device or entity, executing the pattern recognition oriented cluster process 200. In a presently preferred embodiment, the user is requested to input 355 a value of one ("1") if the input file to be used for the pattern recognition oriented cluster process 200 comprises raw data for a set of compounds to be clustered, and a set of mutant strains. The user is requested to input 355 a value of two ("2) if the input file comprises object similarity scores for compound pairs in the respective set of compounds to be clustered. The user is then requested to input 360 to the computer the input file name for either the raw data, or the object similarity score data. The user is also requested to input 365 to the computer the total number of compounds, e.g., N, in the set of compounds to be clustered. In a presently preferred embodiment, the total number of compounds in a set of compounds to be clustered should be in the range of two to five hundred.
The input data process 350 then checks 370 the user-indicated data format for the input file. If the data format is equal to two ("2") 372, meaning the input file contains object similarity score data, the input data process 350 reads in 374 from the input file the names of each the compounds in the set of compounds to be clustered. The input data process 350 stores 374 each compound name in an appropriate entry in a table, e.g., NAME[ ].
The input data process 350 then loops I 376 times, for as many compounds N as there are in the set of compounds to be clustered. For each compound I, i.e., Ith compound, the input data process 350 loops J 378 times, through all compounds up to and including the Ith compound. For example, if the input data process is on the I equal to third (l=3rd) compound, then it loops J equal to the first through third (J=1st to 3rd) compounds. For each loop J 378, the input data process 350 stores 380 the respective object similarity score from the input file for the Ith compound - Jth compound pair in an array entry, e.g., DATA[I][J].
For example, the object similarity score for the first Ith compound - first Jth compound pair is stored in DATA[0][0]. As another example, the object similarity score for the third Ith compound - second J"1 compound pair is stored in DATA[2][1]. In a presently preferred embodiment, the array of object similarity scores, e.g., DATA array, is indexed from zero. In an alternative embodiment, DATA array is indexed from one.
When all compounds J have been looped 378 through for all compound I loops 376, the input data process 350 is ended 382. As previously described, the input data process 350 checks 370 the user- indicated data format for the input file. If the data format is equal to one ("1 ") 384, meaning the input file comprises raw data for a set of compounds to be clustered and a set of mutant strains, the input data process 350 requests the user to input 386 to the computer, i.e., computer or other processing device or entity, the total number of mutant strains, e.g., STRN, used for generating the raw data. In a presently preferred embodiment, the total number of mutant strains should be in the range of one to fifty. The user is also requested to input 388 to the computer a high limit raw data value, e.g., HIGHROUND. In a presently preferred embodiment, the suggested value for HIGHROUND is one hundred. The user is also requested to input 390 to the computer a low limit raw data value, e.g., LOWROUND. In a presently preferred embodiment, the suggested value for LOWROUND is zero. Generally, the high limit raw data value and the low limit raw data value are used to bound the input raw data for a set of compounds to be clustered, and a set of mutant strain pairs.
The input data process 350 then loops I 392 times, once for each of the compounds in the set of compounds, e.g., N compounds, to be clustered. The input data process 350, for each compound I, i.e., Ith compound, loops J 394 times, once for each entry in the input file for compound I.
For each loop J 394, the input data process 350 checks 396 if the input entry from the input file is the first entry for compound I. In a presently preferred embodiment, if it is 426 the first entry for compound I, it is the compound code identification for compound I. The input data process 350 reads in 402 the compound code for compound I from the input file, and stores it in an appropriate entry in a table, e.g., CODE[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
If the current input entry for compound I is not 428 the first entry in the input file for compound I, the input data process 350 checks 398 if it is the second entry in the input file for compound I. In a presently preferred embodiment, if it is 430 the second entry for compound I, it is the compound name for compound I. The input data process 350 reads in 404 the compound name for compound I from the input file, and stores it in an appropriate entry in a table, e.g., NAME[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
If the current input entry for compound I is not 432 the second entry in the input file for compound I, the input data process 350 checks 400 if it is the third entry in the input file for compound I. In a presently preferred embodiment, if it is 434 the third entry for compound I, it is the compound description identification for compound I. The input data process 350 reads in 406 the compound description for compound I from the input file, and stores it in an appropriate entry in a table, e.g., DESC[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I. If the current input entry for compound I is not 438 the third entry in the input file for compound I, then it is a raw data value, e.g., for example, but not limited to, an interaction value or an inhibitor value, for compound I and one of the set of mutant strains. The input data process 350 reads in 408 the raw data value for the compound 1 - mutant strain K pair and stores it in an appropriate entry in a temporary array, e.g., V[I][K]. In a presently preferred embodiment, while the mutant strain K value is related to the loop J 394 value, it is not the same, as loop J 394 loops through more than the respective raw data values for a compound I.
An exemplary array V 450, shown in Figure 11 , depicts raw data, e.g., for example, but not limited to, interaction values or inhibitor values, for n compounds 452 and p mutant strains 454. Thus, in exemplary array V 450 there are n times p (n x p) raw data values 456; i.e., one raw data value for each compound n - mutant strain p pair.
In a presently preferred embodiment, the input data process 350 checks 410 whether the input raw data value for compound I - mutant strain K, e.g., V[I][K], is greater than the high limit raw data value, e.g., HIGHROUND. If it is 412, the input raw data value is set 414 to HIGHROUND, e.g., V[I][K] is set equal to HIGHROUND. The input data process 350 then loops J 394, to process the next input entry for compound I. In a presently preferred embodiment, if the input raw data value for the compound I - mutant strain K pair, e.g., V[I][K], is not 416 greater than HIGHROUND, the input data process 350 checks 418 to see if it is less than the low limit raw data value, e.g., LOWROUND. If it is 420, the input raw data value is set 422 to LOWROUND, e.g., V[I][K] is set equal to LOWROUND. The input data process 350 then loops J 394, to process the next input entry for compound I. If the input raw data value for the compound I - mutant strain k pair, e.g., V[I][K], is not 424 less than LOWROUND, the input data process 350 loops J 394, to process the next input entry value for compound I.
Once all the input file entries J for compound I are read in from the respective input file and stored in the appropriate table or array, the input data process 350 loops J 440 times, once for all the compounds up to and including the Ith compound. For example, if compound I is the fifth compound to be processed by the input data process 350, then the input data process 350 loops J five times, through the first to fifth compounds. For each loop J 440, the input data process 350 generates an object similarity score for the compound I - compound J pair. In a presently preferred embodiment, the object similarity score generated for a compound I - compound J compound pair is a correlation value. Generally, the correlation value, i.e., the similarity value or measure, for a compound I - compound I pair is one, as compound I is being correlated with itself.
In a presently preferred embodiment, a correlation value for a compound I - compound J compound pair is generated from at least one raw data value for compound I and at least one raw data value for compound J. A presently preferred embodiment of an equation for generating a correlation value for a compound I - compound J pair is shown in Equation 1.
∑(I, -I)(J, -J)
Equation 1
Figure imgf000024_0001
In Equation 1 , 1 and J are the raw data values for compounds I and J and the respective mutant strains, and / and J are the respective mean values for the compounds I and J.
After each correlation value is generated 442 for a respective compound I - compound J pair, the input data process 350 loops J 440, to generate the correlation value for the next compound I - compound J pair. When all compounds J have been looped 440 through, the input data process 350 loops 392 to process the next compound I raw data from the input file. When all compounds I have been looped 392 through, i.e., their respective data is input and processed, the input data process 350 is ended 449.
In a presently preferred embodiment, the correlation value for a compound pair is stored in an array, e.g., DATA[ ][ ]. Thus, the correlation value for a compound I - compound J pair is stored in DATA[I][J]. In a presently preferred embodiment, the array of correlation values, e.g., DATA, is indexed from zero. For example, DATA[0][0] represents the correlation value for the first compound-first compound pair. As another example, DATA[1][3] represents the correlation value for the second compound-fourth compound pair. In an alternative embodiment, the DATA array is indexed from one. An exemplary DATA array 460 of correlation values is shown in Figure 12, for n compounds. Thus, DATA array 460 has n compound rows 462 and n compound columns 464. As each correlation value 466 of DATA array 460 is a correlation value for a compound - compound pair, the DATA array 460 is indexed by compound for both its rows and its columns.
As can be determined from the previous description of the input data process 350, as well as can be seen in DATA array 460, a DATA array 460 need only have valid entries in its lower half 468. As the correlation value for any compound I - compound J pair is the same as the correlation value for the respective compound J - compound I pair, the top half 470 of a DATA array 460 is a mirror image of the correlation values in the lower half 468 of the DATA array 460. Also, as previously explained, the correlation values on the diagonal 472 of a DATA array 460 all have a value of one. This is because the correlation values on the diagonal 472 are the values for the identical compound pairs, i.e., compound I - compound I pairs. The correlation values, i.e., similarity values or measures or scores, are necessarily one for a compound I - compound I pair as compound I is being correlated with itself.
As previously explained with regards to Figures 10A-10B, an object similarity score is generated 442 for each compound pair in the set of compounds to be clustered if object similarity scores are not provided in the input file, or otherwise provided to the pattern recognition oriented cluster process 200. In a presently preferred embodiment, an object similarity score is a correlation value, and a correlation value for a compound pair is generated from at least one raw data, or compound, value for the first compound of the compound pair and at least one raw data, or compound, value for the second compound of the compound pair. In a more presently preferred embodiment, the correlation value for a compound pair is derived from all the raw data values for the respective compounds of the compound pair and the mutant strains from the set of mutant strains.
In a presently preferred embodiment of a calculate correlation value process 480, as depicted in Figure 13, summation variables, e.g., SUM_AA, SUM_AB and SUM_BB are initialized to a value of zero 482. The calculate correlation value process 480 generates a correlation value for a compound I - compound J pair.
The average raw data value, e.g., AVE_A, for compound I is generated 483. Generally, all compound I - mutant strain raw data values, for all mutant strains, are added together, and the sum is divided by the total number of mutant strains to generate the average raw data value, e.g., AVE_A, for compound I.
The average raw data value, e.g., AVE_B, for compound J is also generated 484. Generally, all compound J - mutant strain raw data values, for all mutant strains, are added together, and the sum is divided by the total number of mutant strains to generate the average raw data value, e.g., AVE_B, for compound J.
The calculate correlation value process 480 then loops K 485 times, once for each of the mutant strains in the set of mutant strains. For each K loop 485, the calculate correlation value process 480 sets 486 the entry in the temporary array of raw data for the compound I - mutant K pair, e.g., V[I][K], to its original raw data value minus the average raw data value for compound I, e.g., AVE_A, as shown in Equation 2
V[I][K] = V[I][K] - AVE_A Equation 2
For each loop K 485, the calculate correlation value process 480 also sets 487 the entry in the temporary array of raw data for the compound J - mutant K pair, e.g., V[J][K], to its original raw data value minus the average raw data value for compound J, e.g., AVE_B, as shown in Equation 3.
V[J][K] = V[J][K] - AVE_B Equation 3
Thus, each raw data, or compound, value, for a respective compound, is decreased, or otherwise altered, by the average raw data value for the compound. For each K loop 485, the calculate correlation value process 480 also generates 488 a new value of SUM_AB, SUM_AA and SUM_BB. More specifically, for each loop K 485, the new value for the compound I - mutant K pair in the temporary array of raw data, e.g., V[I][K], and the new value for the compound J - mutant K pair in the temporary array of raw data, e.g., V[J][K], are multiplied together and added to a running sum, e.g., SUM_AB, as shown in Equation 4
SUM_AB = SUM_AB + (V[I][K] x V[J][K]) Equation 4
Thus, the raw data value for each compound I - mutant K pair, altered by the average raw data value for compound I, is multiplied by the raw data value for the respective compound J - mutant K pair, altered by the average raw data value for compound J. A running summation, e.g., SUM_AB, of these multiplications is also generated. For each loop K 485, the new value for the compound I - mutant K pair in the temporary array of raw data, e.g., V[I][K], is squared, i.e., multiplied with itself, and added to a running sum, e.g., SUM_AA, as shown in Equation 5.
SUM_AA = SUM_AA + (V[I][K] x V[I][K]) Equation 5 Thus, the raw data value for each compound I - mutant K pair, altered by the average raw data value for compound I, is multiplied by itself. A running summation, e.g., SUM_AA, of these multiplications is also generated.
Also for each loop K 485, the new value for the compound J - mutant K pair in the temporary array of raw data, e.g., V[J][K], is squared, i.e., multiplied with itself, and added to a running sum, e.g., SUM_BB, as shown in Equation 6.
SUM_BB = SUM_BB + (V[J][K] x V[J][K]) Equation 6
Thus, the raw data value for each compound J - mutant K pair, altered by the average raw data value for compound J, is multiplied by itself. A running summation, e.g., SUM_BB, of these multiplications is also generated. After looping K 485 times, through all compound I - mutant strains K pairs and through ail compound J - mutant strains K pairs, the correlation value for the compound I - compound J pair is generated 490. The correlation value for a compound I - compound J pair is set to SUM_AB divided by the square root of the multiplication of SUM_AA and SUM_BB, as shown in Equation 7.
, sυM - AB ==r c E_qua ♦tion 77
JSUM _AAxSUM _BB
Once the correlation value for a compound I - compound J pair is generated 490, the calculate correlation value process 480 is ended 491.
Referring again to Figure 8, in a presently preferred embodiment of a pattern recognition oriented cluster process 200, once the input data process 210 is executed, a determine correlation distribution process 215 is executed. Generally, the determine correlation distribution process 215 generates a distribution of the correlation values, i.e., similarity measures, for ail compound pairs in the set of compounds to be clustered. As shown in Figure 14, the determine correlation distribution process 215 loops 502 through each unique correlation value in the array of correlation values, e.g., DATA array. As previously discussed with regard to Figure 12, in a presently preferred embodiment, the unique correlation values for a set of compounds to be clustered is stored in the lower half, or bottom, 468 of the DATA array, below, and not including, the diagonal 472. Thus, in a presently preferred embodiment, the determine correlation distribution process 215 loops 502 through the lower half 468 of the DATA array, i.e., through the unique correlation values.
In a presently preferred embodiment, the determine correlation distribution process 215 multiplies 504 each unique correlation value by ten ("10"). The determine correlation distribution process 215 then determines 506 the group, of a number of groups of correlation ranges, that the respective unique correlation value, multiplied by a factor of ten, is within. In a presently preferred embodiment, there are nineteen groups of correlation ranges that each of the unique correlation values in the DATA array may fall within.
Table 1 is a presently preferred embodiment of the groups of correlation ranges. Column A of Table 1 represents the presently preferred range of correlation values for each of the nineteen groups. Column B of Table 1 represents the presently preferred range of correlation values for each of the nineteen groups multiplied by a factor of ten.
Figure imgf000028_0001
Figure imgf000029_0001
Table 1
In a presently preferred embodiment, the determine correlation distribution process 215 keeps a running summation 508, e.g., COUNT[ ], of the number of unique correlation values from the DATA array that are within each of the nineteen groups of correlation ranges.
In a presently preferred embodiment, the determine correlation distribution process 215 also keeps a running summation 510, e.g., SUM, of all the unique correlation values in the DATA array.
Once all the unique correlation values in the DATA array are categorized into one of the groups of correlation ranges, the determine correlation distribution process 215 generates 512 the average correlation value from the running summation, e.g., SUM, for all the unique correlation values, i.e., for all unique compound I - compound J pairs, where I does not equal J. The determine correlation distribution process 215 also determines 514 the percentage of unique correlation values in each of the nineteen groups of correlation ranges, from the various stored running summations, e.g., COUNTf ], for all the unique correlation values, i.e., for all unique compound I - compound J pairs, where I does not equal J. For example, the determine correlation distribution process 215 determines 514 the percentage of unique correlation values in the DATA array that are in the 1st group, e.g., COUNT[0], of correlation ranges, the percentage of unique correlation values in the DATA array that are in the 2nd group of correlation ranges, e.g., COUNT[1], and so on.
Once the percentage of unique correlation values in each of the correlation range groups, e.g., COUNT[ ], is generated 514, the determine correlation distribution process 215 is ended 516. Referring again to Figure 8, in a presently preferred embodiment of a pattern recognition oriented cluster process 200, a set grouping parameters process 220 is executed. Generally, the set grouping parameters process 220 allows the user of the pattern recognition oriented cluster process 200 to choose whether to use default, pre-established, grouping and pattern matrix generation, or group selection, parameters, or input their own. As shown in Figure 15, a presently preferred embodiment of the set grouping parameters process 220 requests 566 the user to input to the computer, i.e., computer or other processing device or entity, whether they wish to change the default criteria for establishing groups of compounds. The user is requested 566 to indicate whether or not they wish to choose their own group formation selection, or correlation selecting, or grouping, limits, or parameters. The set grouping parameters process 220 expects a YES or NO response from the user. If the user responds with a NO 567, indicating that they wish to use the default correlation selecting limits, or grouping parameters, the set grouping parameters process 220 checks 565 whether there are any correlation values in the DATA array that have values that are greater than or equal to the default first group selection parameter value. In a presently preferred embodiment, the set grouping parameters process 220 checks the number of correlation values in the first, second and third groups of correlation ranges, e.g., COUNT[0], COUNT[1] and COUNT[2], which were set in the determine correlation distribution process 215.
In a presently preferred embodiment, the first group selection parameter is an initial correlation group limit parameter. In a presently preferred embodiment, the default initial correlation group limit parameter, e.g., INICORR, has a value of 0.7. In a presently preferred embodiment, if there are no 573 unique correlation values in the DATA array equal to or greater than INICORR, e.g., COUNT[0] and COUNT[1] and COUNT[2] are all equal to zero, the set grouping parameters process 220 ends 570 the pattern recognition oriented cluster process 200. This is because there are no groups that can be formed with the default group selection parameters, or correlation selecting limits. If there are 574 unique correlation values in DATA array equal to or greater than the default INICORR, e.g., either COUNT[0] and/or COUNT[1] and/or COUNT[2] is greater than zero, the set grouping parameters process 220 sets 571 each of the group selection parameters to a default value. In a presently preferred embodiment, there are three group selection parameters. A first group selection parameter is used for selecting a first compound pair for a new group; it is a group core defining limit. In a presently preferred embodiment, the first pair of compounds of a new group of compounds must have a correlation value greater than or equal to the first group selection parameter. In a presently preferred embodiment, the first group selection parameter is an initial correlation group limit parameter, e.g., INICORR, whose default value is 0.7.
A second group selection parameter is used for selecting an add-in, or new, compound for a group. The second group selection parameter is an add-in score, or limit or value, that defines the nearest neighbor compounds of the core group of compounds. In a presently preferred embodiment, an add-in, or new, compound to a group of compounds must, with a compound member of the group, have a correlation value greater than or equal to the second group selection parameter. In a presently preferred embodiment, the second group selection parameter is an add-in correlation group limit parameter, e.g., MAXCORR, whose default value is 0.6.
A third group selection parameter is used for defining the furthest neighbor compounds in a core group of compounds. In a presently preferred embodiment, an add-in, or new, compound to a group of compounds must, with each compound member of the group, have a correlation value greater than or equal to the third group selection parameter. In a presently preferred embodiment, the third group selection parameter is a minimum correlation group limit parameter, e.g., MINCORR, whose default value is 0.4.
In a more presently preferred embodiment, a fourth group selection parameter is used to define the compactness of the established groups of compounds. The fourth group selection parameter is mean distance score, or limit, of a potential add- in compound and all the compounds already in the group. In a presently preferred embodiment, the average correlation value for the pairs of compounds comprised of the potential add-in compound and each compound already in the group must be greater than or equal to the fourth group selection parameter. In a presently preferred embodiment, the fourth group selection parameter is a minimum average correlation group limit parameter, e.g., AVECORR, whose default value is 0.5.
In a presently preferred embodiment, if the user responds with a YES 568 to the request for whether the user wishes to change the default correlation selecting limits, the set grouping parameters process 220 requests the user to input 569 a value for each of four correlation group selection parameters. The set grouping parameters process 220 requests the user to input an initial correlation value for selecting a first compound pair for a new group, e.g. INICORR. The set grouping parameters process 220 also requests the user to input a correlation value for adding a new compound to an existing group, e.g., MAXCORR. The set grouping parameters process 220 also requests the user to input an average correlation value, e.g., AVECORR, for ail compounds in a group, whenever a new compound is a possible addition to the group. The set grouping parameters process 220 also requests the user to input a minimum correlation value for any compound to be added to a group, e.g., MINCORR.
Whether the user has chosen to use default group selection parameter values or their own group selection parameter values, the set grouping parameters process 220 sets 572 the default number of output color groups, e.g., COLOR[ ], to six, for use by a generate pattern matrix process 255 of the pattern recognition oriented cluster process 200, for formatting and outputting a pattern matrix. The set grouping parameters process 220 then requests 575 the user to indicate whether they wish to use their own number of output color groups.
If the user indicates that NO 576, they do not wish to use their own number of output color groups, i.e., they will use the default number of output color groups, the set grouping parameters process 220 sets 580 each of the six default color groups to the lower correlation value of their correlation value range. Specifically, in a presently preferred embodiment, the set grouping parameters process 220 sets each of the default color groups, e.g., COLOR[0] to COLOR[5], to a default lower correlation value. A presently preferred embodiment of the default correlation value range and the default lower correlation value for each of the six default color groups is shown in Table 2.
Figure imgf000032_0001
Figure imgf000033_0001
TABLE 2
If, however, the user indicates that YES 577, they do wish to use their own output color groups, the user is requested 578 to input to the computer, i.e., computer or other processing device or entity, the number of color groups, e.g., M, that they want. In a presently preferred embodiment, the maximum number of color groups that a user may designate is ten.
For each color group, e.g., COLOR[ ], the set grouping parameters process 220 then requests 579 the user to input to the computer the respective lower correlation value. In a presently preferred embodiment, the set grouping parameters process 220 suggests that the user set the final color group lower correlation value to -1.0. In a presently preferred embodiment, COLOR[ ] is indexed from zero. The set grouping parameters process 220, for each color group, e.g., X from one to the total number of color groups, e.g., M, sets COLOR[X-1] to the user inputted lower correlation value.
In an alternative embodiment, COLOR[ ] is indexed from one. In this alternative embodiment, the set grouping parameters process 220, for each color group X from one to the total number of color groups M sets COLOR[X] to the user inputted lower correlation value. Once the default lower correlation values are set for each default color group, or, alternatively, the user inputs all the lower correlation values for each of the user- requested color groups, the set grouping parameters process 220 is ended 581. Referring again to Figure 8, upon executing the set grouping parameters process 220 in the pattern recognition oriented cluster process 200, a group compounds process 225 is executed. Generally, the group compounds process 225 groups the compounds in the set of compounds to be clustered into one or more groups, based on default, or alternatively, user-specified, group selection parameters, or correlation selecting limits or cluster parameters. Referring to Figure 16, in a presently preferred embodiment, the group compounds process 225 first writes 542 the mirror image of the bottom half of the correlation value array, e.g., DATA, to the top half of the array. For example, referring to Figure 17A, only the bottom half 651 of the DATA array 654, below the diagonal 652, has valid correlation values at this time. The group compounds process 225 writes 542 all the values in the bottom half 651 of the DATA array to the respective entries in the top half 653 of the array. Thus, the resultant DATA array, e.g. DATA array 655 of Figure 17B, has valid correlation values for all compound I - compound J pairs, where I does not equal J. In the examples of Figures 17A and 17B, the correlation value in DATA[2][1] is written to DATA[1][2]. Likewise, the correlation value in DATA[3][1] is written to DATA[1][3], and the correlation value in DATA[3][2] is written to DATA[2][3]. Referring again to Figure 16, after writing 542, or otherwise filling in, the
DATA array, the group compounds process 225 loops 526 through all compound pairs in the DATA array. The group compounds process 225 attempts to locate 528 an initial compound pair for a new group, e.g., G; i.e., the group compounds process 225 attempts to locate 528 a compound pair whose correlation value is greater than or equal to the initial correlation group limit parameter INICORR.
As previously discussed, the presently preferred default value for INICORR is 0.7. Referring to exemplary DATA array 660 in Figure 18, the first compound pair with a correlation value greater than or equal to the default value of INICORR are compounds one and three; i.e. DATA[1][3] is equal to 0.7. DATA[1][1], while having a correlation value of 1.0, which is greater than INICORR equal to 0.7, is not the first compound pair as it is comprised of a single compound, rather than a compound pair. Thus, the first two compounds of the new group G are compound one and compound three.
The group compounds process 225, upon locating 528 an initial pair of compounds for a new group G, loops 529 through all the compound pairs, now attempting to locate 530 a potential compound to add into group G. More specifically, in a presently preferred embodiment, the group compounds process 225 loops 529 through all the possible compound pairs looking for a potential add-in compound, i.e., a compound to be added to the group G, whose correlation value with a compound already in group G is greater than or equal to the add-in correlation group limit parameter MAXCORR.
As previously discussed, the presently preferred default value for MAXCORR is 0.6. In exemplary DATA array 660 of Figure 18, the first compound that paired with compounds one or three of the present group G has a correlation value greater than or equal to the default value MAXCORR is compound six; i.e. DATA[3][6], is equal to 0.8. Thus, compound six is a potential add-in compound to the present group G. Referring to all the entries in the row 661 corresponding to compound one, only DATA[1][1] and DATA[1][3] have a correlation value greater than or equal to MAXCORR; i.e., compound one - compound one has a correlation value of one, and the compound one - compound three pair has a correlation value of 0.7. However, compounds one and three, corresponding to the respective correlation values DATA[1][1] and DATA[1][3], are already in group G.
Referring to the entries in the row 662 corresponding to compound three, the first correlation value, e.g., DATA[3][1], has a correlation value greater than MAXCORR. However, as already noted, compounds three and one, corresponding to the respective correlation value DATA[3][1], are already in group G. The next entry in row 662 with a correlation value greater than MAXCORR is
DATA[3][6], with a correlation value of 0.8. Compound six is not already in group G, so it is a potential add-in compound.
Upon locating 530 a potential add-in compound for group G, the group compounds process 225 checks 531 whether the correlation values of the potential add-in compound paired with each compound already in group G are all greater than or equal to the minimum correlation group limit parameter MINCORR.
As previously discussed, the presently preferred default value for MINCORR is 0.4. In our example, the compounds currently in group G are one and three. The potential add-in compound for group G is six. In DATA array 660 of Figure 18, the correlation value for the compound one - compound six pair, e.g., DATA[1][6], is 0.5, which is greater than the default value MINCORR. Also, the correlation value for the compound three - compound six pair, e.g., DATA[3][6] is 0.8, which is greater than the default value MINCORR.
If the correlation values of the potential add-in compound paired with each compound already in group G are not 532 all greater than or equal to MINCORR, then the group compounds process 225 loops 529 to the next compound pair in search of a potential add-in compound for group G. In a presently preferred embodiment, if all the correlation values of the potential add-in compound paired with each compound already in group G are 533 greater than or equal to MINCORR, the group compounds process 225 checks 534 whether, assuming the potential add-in compound were included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter AVECORR.
As previously discussed, the presently preferred default value for AVECORR is 0.5. In our example, the compounds currently in group G are one and three, and the potential add-in compound for group G is six. Thus, the group compounds process 225 checks the average correlation value of the compound pairs compound one - compound three, e.g., DATA[1][3], compound one - compound six, e.g., DATA[1][6] and compound three - compound six, e.g., DATA[3][6].
In DATA array 660 of Figure 18, the correlation value for the compound one - compound three pair, e.g., DATA[1][3], is 0.7, the correlation value for the compound one - compound six pair, e.g., DATA[1][6], is 0.5, and the correlation value for the compound three - compound six pair, e.g., DATA[3][6], is 0.8. The average correlation value of DATA [1][3], DATA[1][6] and DATA[3][6] is 0.67, which is greater than the default value AVECORR.
If the average correlation value of all the compound pairs of group G, assuming the potential add-in compound is a member of group G, is not 535 greater than or equal to AVECORR, then the group compounds process 225 loops 529 to the next compound pair, in search of a potential add-in compound for group G.
If, however, the average correlation value of all the compound pairs in group G, assuming the potential add-in compound is a member of group G, is 536 greater than or equal to AVECORR, then the group compounds process 225 checks 537 whether the potential add-in compound is already a member of group G.
If the potential add-in compound is 538 a member of group G, the group compounds process 225 loops 529 to the next compound pair, in search of a potential add-in compound for group G. If, however, the potential add-in compound is not 539 a member of group G, the group compounds process 225 adds 540 the potential add-in compound to group G. Thus, in the example of DATA array 660 of Figure 18, compound six is added to group G, which is already comprised of compounds one and three. The group compounds process 225 then loops 529 to the next compound pair, in search of another potential add-in compound for group G.
Once the group compounds process 225 loops 529 through all the possible compound pairs for group G, it then loops 526 to the next compound pair for creating a new group G of compounds. When the group compounds process 225 loops 526 through all compound pairs for initiating new groups of compounds, it is ended 527.
A more detailed presently preferred embodiment of a group compounds process 650, as shown in Figures 19A-19C, first writes 648, or otherwise fills in, the mirror image of the bottom half of the correlation value array, e.g., DATA, to the top half of the array.
The group compounds process 650 then loops 604 through all the compound pairs, attempting to locate 600 a compound X - compound Y pair whose correlation value, e.g., DATA[X][Y], is greater than or equal to INICORR, i.e., the initial correlation group limit parameter and the minimum value for selecting a first compound pair for a group. Once a compound X - compound Y pair whose correlation value is greater than or equal to INICORR is located 600, a new group G of compounds is initiated.
The group compounds process 650 then loops 601 through each correlation value in the array of correlation values, e.g., DATA. If all correlation values have been looped 601 through for the current group G, the group compounds process 650 continues its loop 604 through all the compound pairs, attempting to locate 600 a new compound X - compound Y pair for initiating a new group G. If all compound pairs have been looped 604 through for generating new groups of compounds, the group compounds process 650 ends 602. As previously described, the group compounds process 650 loops 601 through each correlation value in the DATA array. For each correlation value in DATA, the group compound process 650 checks 603 whether the row compound of the correlation value, e.g., DATA[row compound][column compound], is equal to compound X, which is the row compound of the initial compound pair of group G, and, thus, already a member of group G. If the row compound of the current correlation value being processed, i.e., the current correlation value, is equal to 619 compound X, the group compounds process 650 loops 601 to the next correlation value in the DATA array. For example, if the current correlation value row compound is 2 and column compound is 3, i.e., DATA[1][2], and compound 2 is the X compound in group G, the group compounds process 650 loops 601 to the next correlation value in DATA array, i.e., DATA[1][3].
Thus, in a presently preferred embodiment, the group compounds process 650 does not check the correlation values DATA[A][B] where A is equal to X, the row compound of the initial compound pair comprising group G. In the previous example, where the X compound of group G is compound 2, the group compounds process 650 does not check any correlation value in the compound 2 row of the DATA array; i.e., DATA[1][y]. For each correlation value in DATA, the group compound process 650 also checks 603 whether the column compound of the correlation value, e.g., DATA[row compound][column compound], is equal to compound Y, which is the column compound of the initial compound pair of group G, and, thus, already a member of group G. If the column compound of the current correlation value is equal to 619 compound Y, the group compounds process 650 loops 601 to the next correlation value in the DATA array. For example, if the current correlation value row compound is 2 and column compound is 3, i.e., DATA[1][2], and compound 3 is the Y compound in group G, the group compounds process 650 loops 601 to the next correlation value in DATA array, i.e., DATA[1][3]. Thus, in a presently preferred embodiment, the group compounds process
650 does not check the correlation values DATA[A][B] where B is equal to Y, the column compound of the initial compound pair comprising group G. In the previous example, where the Y compound of group G is compound 3, the group compounds process 650 does not check any correlation value in the compound 3 column of the DATA array; i.e., DATA[x][2].
If the row compound of the current correlation value, i.e., DATA[row compound] [column compound], is not equal to 620 compound X of group G and the column compound of DATA[row compound][column compound] is not equal to 620 compound Y of group G, then the group compounds process 650 checks 605 whether the row compound of the current correlation value is already a member of group G. Thus, the group compounds process 650 checks 605 whether the row compound of the current correlation value is a member of group G, but not the row compound X comprising the initial pair of compounds of group G. If there is only the initial pair of compounds in group G at this time, i.e., compounds X and Y, then the group compounds process 650 checks 605 whether the row compound of the current correlation value is equal to Y.
For example, exemplary group G is comprised of initial compound pair compound 1 - compound 3, where compound 1 is the X compound and compound 3 is the Y compound, i.e., DATA[0][2] is the first correlation value found that is greater than or equal to INICORR for new group G. The group compounds process 650 checks 605 whether the current correlation value is in the row of compound 3, i.e., DATA[2][y]. As another example, exemplary group G is comprised of initial compound pair compound 2 - compound 4, where compound 2 is the X compound and compound 4 is the Y compound, i.e., DATA[1][3] is the first correlation value that is greater than or equal to INICORR for new group G. Further, exemplary group G is also comprised of add-in compound 5. The group compounds process 650 checks 605 whether the current correlation value is in the row of compound 4, i.e., DATA[3][y], or in the row of compound 5, i.e., DATA[4][y].
If the row compound for the current correlation value is equal to 622 a compound already in group G, but not the initial row compound X of group G, then the group compounds process 650 checks 607 whether the current correlation value is greater than or equal to the group selection parameter for adding a compound to a group, e.g., MAXCORR. In a presently preferred embodiment, the default value of MAXCORR is 0.6.
If the current correlation value is not 613 greater than or equal to MAXCORR, the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
If the current correlation value is 614 greater than or equal to MAXCORR, the group compounds process 650 sets 616 a flag, e.g., FLAG, to indicate that the column compound of the current correlation value may possibly be added to the group G, e.g., FLAG is set to TRUE. The group compounds process 650 then loops 608 through all the compounds already in group G. For each compound in group G, the group compounds process 650 checks 609 whether the correlation value of the group G compound - current correlation value column compound pair, e.g., DATA[compound in group G][current correlation value column compound], is less than the minimum correlation group limit parameter, e.g., MINCORR. In a presently preferred embodiment, the default value of MINCORR is 0.4.
For example, exemplary group G comprises compounds 2, 5, and 6. The current correlation value is the correlation value for the row compound 5 - column compound 8 compound pair. The group compounds process 650 checks 609 whether any of the correlation values DATA[1][7], for respective compound 2 of group G - compound 8 (column compound of current correlation value) pair, DATA[4][7], for respective compound 5 of group G - compound 8 (column compound of current correlation value) pair, or DATA[5][7], for respective compound 6 of group G - compound 8 (column compound of current correlation value) pair, is less than MINCORR.
If the correlation value of any group G compound - current correlation value column compound pair is less than 627 MINCORR, the group compounds process 650 sets 611 a flag, e.g., FLAG, to indicate that the column compound of the current correlation value is not to be added to group G, e.g., FLAG is set to FALSE. Using the previous example, if DATA[1][7] or DATA[4][7] or DATA[5][7] is less than 627 MINCORR, the group compounds process 650 sets 611 FLAG to FALSE.
If the correlation values of ail group G compound - current correlation value column compound pairs are greater than or equal to 628 MINCORR, FLAG is not set FALSE.
Once all compounds already in group G are looped 608 through, the group compounds process 650 checks 635 whether FLAG is set to TRUE. If no 636, the current correlation value column compound is not to be added to the current group G, and the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
If FLAG is 641 set TRUE, the group compounds process 650 checks whether, if the current correlation value column compound was included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR. A presently preferred embodiment default value for AVECORR is 0.5.
In a presently preferred embodiment, the group compounds process 650 assumes that the current correlation value column compound is a compound of group G and sums 643 the correlation values for all the compound pairs in group G. The group compounds process 650 then determines the average 638, e.g., AVG, correlation value for all the compound pairs in group G, by dividing the summation of the correlation values for all the compound pairs in group G by the number of compound pairs in group G.
Using the previous example, the group compounds process 650 sums 643 the correlation values DATA[1][4], for group G compound 2 - group G compound 5 compound pair, DATA[1][5], for group G compound 2 - group G compound 6 compound pair, DATA[4][5], for group G compound 5 - group G compound 6 compound pair, DATA[1][7], for group G compound 2 - current correlation value column compound 8 compound pair, DATA[4][7], for group G compound 5 - current correlation value column compound 8 compound pair, and DATA[5][7], for group G compound 6 - current correlation value column compound 8 compound pair. The group compounds process 650 then determines the group G compound pair correlation value average 638, e.g., AVG, by dividing the summation of the correlation values for all the group G compound pairs by the number of compound pairs in group G; in our example, the summation of the correlation values of all the compound pairs in group G is divided by six (6).
In a presently preferred embodiment, the group compounds process 650 then checks 639 whether AVG is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR. If AVG is not 640 greater than or equal to AVECORR, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, AVG is 625 greater then or equal to AVECORR, the group compounds process 650 checks 629 whether the current correlation value column compound is already a member of group G.
If the current correlation value column compound is 634 already a member of group G, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, the current correlation value column compound is not 633 already a member of group G, the group compounds process 650 adds 646 the column compound for the current correlation value to group G. The group compounds process 650 then loops 601 to the next correlation value in the array DATA, to look for another potential add-in compound for group G. Thus, using the previous example, as noted, group G currently comprises compounds 2, 5 and 6 and the current correlation value column compound is compound 8. If the AVG of the correlation values for DATA[1][4], DATA[1][5], DATA[1][7], DATA[4][5], DATA[4][7] and DATA[5][7] is 625 greater than or equal to AVECORR, and as the current correlation value column compound, compound 8, is not 633 already a member of group G, then compound 8 would be added 646 to group G.
As previously discussed, the group compounds process 650 checks 605 whether the row compound of the current correlation value, i.e., DATA[row compound][column compound], is already a member of group G. If it is not 621 , the group compounds process 650 checks 606 whether the column compound of the current correlation value is already a member of group G. Thus, the group compounds process 650 checks 606 whether the column compound of the current correlation value is already a member of group G, but not the column compound Y comprising the initial pair of compounds of group G. If there is only the initial pair of compounds in group G at this time, i.e., compounds X and Y, then the group compounds process 650 checks 606 whether the column compound of the current correlation value is equal to X.
For example, exemplary group G is comprised of initial compound pair compound 1 - compound 3, where compound 1 is the X compound and compound 3 is the Y compound of group G. The group compounds process 650 checks 606 whether the current correlation value is in the column of compound 1 , i.e., DATA[x][0].
As another example, exemplary group G is comprised of initial compound pair compound 2 - compound 4, where compound 2 is the X compound and compound 4 is the Y compound of group G. Further, exemplary group G is also comprised of add-in compound 5. The group compounds process 650 checks 606 whether the current correlation value is in the column of compound 2, i.e., DATA[x][1], or in the column of compound 5, i.e., DATA[x][4]. If the column compound for the current correlation value is not 623 equal to a compound already in group G, other than the Y compound of group G, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. Using the example above, where the current group G is comprised of X compound 2, Y compound 4 and add-in compound 5, if the current correlation value is for the row compound 2 - column compound 7 pair, then the group compounds process 650 loops 601 to the next correlation value in the array DATA because compound 7, the current correlation value column compound, is not compound 2 or compound 5, the compounds in group G other than the Y compound.
If the column compound for the current correlation value is equal to 624 a compound already in group G, other than the initial column compound Y of group G, then the group compounds process 650 checks 607 whether the current correlation value is greater than or equal to the group selection parameter value for adding a new compound to a group, e.g., MAXCORR.
If the current correlation value is not 613 greater than or equal to MAXCORR, the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G. If the current correlation value is 615 greater than or equal to MAXCORR, the group compounds process 650 sets 617 a flag, e.g., FLAG, to indicate that the row compound of the current correlation value may possibly be added to the group G, e.g., FLAG is set to TRUE.
The group compound process 650 then loops 608 through all the compounds already in group G. For each compound in group G, the group compounds process 650 checks 610 whether the correlation value of the current correlation value row compound - group G compound pair, e.g., DATA[current correlation value row compound] [compound in group G], is less than the minimum correlation group limit parameter, e.g., MINCORR. For example, exemplary group G comprises compounds 2, 5, and 6. The current correlation value is the correlation value for the row compound 1 - column compound 6 compound pair. The group compounds process 650 checks 610 whether any of the correlation values DATA[0][1], for respective compound 1 (row compound of current correlation value) - compound 2 of group G pair, DATA[0][4], for respective compound 1 (row compound of current correlation value) - compound 5 of group G pair, or DATA[0][5], for respective compound 1 (row compound of current correlation value) - compound 6 of group G pair, is less than MINCORR. If the correlation value of any current correlation value row compound - group G compound pair is less than 631 MINCORR, the group compounds process 650 sets 612 a flag, e.g., FLAG, to indicate that the row compound of the current correlation value is not to be added to group G, e.g., FLAG is set to FALSE. Using the previous example, if DATA[0][1] or DATA[0][4] or DATA[0][5] is less than 631 MINCORR, the group compounds process 650 sets 612 FLAG to FALSE.
If the correlation values of all current correlation row compound - group G compound pairs are greater than or equal to 632 MINCORR, FLAG is not set FALSE. Once all compounds already in group G are looped 608 through, the group compounds process 650 checks 635 whether FLAG is set to TRUE. If no 636, the current correlation value row compound is not to be added to the current group G, and the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G. If FLAG is 642 set TRUE, the group compounds process 650 checks whether, if the current correlation value row compound was included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR.
In a presently preferred embodiment, the group compounds process 650 assumes that the current correlation value row compound is a compound of group G and sums 644 the correlation values for all the compound pairs in group G. The group compounds process 650 then determines the average 638, e.g., AVG, correlation value for all the compound pairs in group G, by dividing the summation of the correlation values for all the compound pairs in group G by the number of compound pairs in group G.
Using the previous example, group G is comprised of compounds 2, 5 and 6 and the current correlation value row compound is compound 1. The group compounds process 650 sums 644 the correlation values DATA[1][4], for group G compound 2 - group G compound 5 compound pair, DATA[1][5], for group G compound 2 - group G compound 6 compound pair, DATA[4][5], for group G compound 5 - group G compound 6 compound pair, DATA[0][1], for current correlation value row compound 1 - group G compound 2 compound pair, DATA[0][4], for current correlation value row compound 1 - group G compound 5 compound pair, and DATA[0][5], for current correlation value row compound 1 - group G compound 6 compound pair. The group compounds process 650 then determines the group G compound pair correlation value average 638, e.g., AVG, by dividing the summation of the correlation values for all the group G compound pairs by the number of compound pairs in group G; in our example, the summation of the correlation values of all the compound pairs in group G is divided by six (6).
In a presently preferred embodiment, the group compounds process 650 then checks 639 whether AVG is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR. If AVG is not 640 greater than or equal to AVECORR, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, AVG is 626 greater then or equal to AVECORR, the group compounds process 650 checks 630 whether the current correlation value row compound is already a member of group G. If the current correlation value row compound is 618 already a member of group G, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, the current correlation value row compound is not 637 already a member of group G, the group compounds process 650 adds 647 the row compound for the current correlation value to group G. The group compounds process 650 then loops 601 to the next correlation value in the array DATA, to look for another potential add-in compound for group G. Thus, using the previous example, as noted, group G currently comprises compounds 2, 5 and 6, and the current correlation value row compound is compound 1. If the AVG of the correlation values for DATA[1][4], DATA[1][5], DATA[4][5], DATA[0][1], DATA[0][4] and DATA[0][5] is 626 greater than or equal to AVECORR, and as the current correlation value row compound, compound 1 , is not 637 already a member of group G, then compound 1 would be added 647 to group G.
Referring again to Figure 8, after executing the group compounds process 225 of Figure 16 or 650 of Figures 19A, 19B and 19C, the pattern recognition oriented cluster process 200 executes a determine overlapping groups process 235. Generally, the determine overlapping groups process 235 determines if there are any established groups whose compounds are all contained in at least one other group. For each group X of compounds, the determine overlapping groups process 235 checks whether the compounds in group X are subsumed within any other group of compounds.
Referring to Figure 20, the determine overlapping groups process 235 loops X 750 times, once for each of the groups of compounds established in the group compounds process 225 of Figure 16, and/or the group compounds process 650 of Figures 19A-19C. When all the groups X of compounds are looped through, the determine overlapping groups process 235 ends 751.
For each loop X 750, the determine overlapping groups process 235 loops Y 752 times, once for each of the groups of compounds. When all the groups Y of compounds are looped 752 through, the determine overlapping groups process 235 loops 750 to the next group X of compounds.
For each group Y of compounds, the determine overlapping groups process 235 checks 753 whether group X is group Y. If so 754, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks 753 whether group Y is already labeled an overlapping group, i.e., group Y is no longer to be processed. If group Y is 754 already labeled an overlapping group, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks
753 whether the number of compounds in group Y is greater than the number of compounds in group X. If yes 754, the determine overlapping groups process 235 loops 752 to the next group Y.
If group Y is not 755 group X, group Y is not 755 already labeled an overlapping group, and the number of compounds in group Y is not 755 greater than the number of compounds in group X, then each of the compounds in group Y is checked, or compared, 760 to each of the compounds in group X. The determine overlapping groups process 235 next checks 756 whether all of the compounds in group Y are also in group X. If they are 758, then group Y is marked, or labeled, 759 as an overlapping group, i.e., it is no longer to be used. The determine overlapping groups process 235 then loops 752 to the next group Y. If, however, even one compound in group Y is not 757 in group X, group Y is not marked as an overlapping group at this time, and the determine overiapping groups process 235 loops 752 to the next group Y.
Thus, for example, if a group X comprises compounds 1 , 4, 7 and 8 and a group Y comprises compounds 1 , 4 and 7, group Y is marked as an overlapping group, as its compounds are completely subsumed within group X. As another example, if a group X comprises compounds 1 , 2 and 8, and a group Y comprises compounds 1 , 2 and 9, group Y is not marked as an overlapping group, as it comprises a compound, compound 9, that is not also in group X. As shown in Figure 8, upon executing the determine overlapping groups process 235, the pattern recognition oriented cluster process 200, executes a combine groups process 240. Generally, the combine groups process 240 combines one or more groups of compounds that have one or more compounds in common. For each group H of compounds, the combine groups process 240 checks whether one or more compounds in group H are also a member of another group K. If yes, the combine groups process 240 combines the group H and group K into one new group.
Referring to Figures 21 A-21 B, the combine groups process 240 loops H 850 times, once for each of the groups of compounds established in the group compounds process 225 of Figure 16, or the group compounds process 650 of
Figures 19A-19C. When all of the groups H of compounds are looped 850 through, the combine groups process 240 ends 851.
The combine groups process 240 checks 852 for each group H whether the group is marked as an overlapping group. As previously discussed, in a presently preferred embodiment, groups are marked, or labeled, as overlapping in the determine overlapping groups process 235. If the current group H is 853 marked as an overlapping group, i.e., it is not to be processed anymore, then the combine groups process 240 loops 850 to the next group H.
If, however, the current group H is not 854 marked as an overlapping group, the combine groups process 240 loops K 855 times, once for each group that has not already been processed as an H group. For example, if there are ten groups of compounds and current group H is the second group, then K is eight, and the combine groups process 240 loops 855 eight times, through the third through tenth groups.
For each group K, the combine groups process 240 checks 856 whether the group is marked as an overlapping group. If it is 858, the combine groups process 240 loops 855 to the next group K.
If, however, the current group K is not 857 marked as an overlapping group, the combine groups process 240 loops J 859 times, once for each compound in group K. The combine groups process 240, for each compound J in group K, loops I 860 times, once for each compound in the initial group H. When the last I compound in group H is looped 860 through, the combine groups process 240 loops 859 to the next J compound in group K. Thus, the combine groups process 240 checks every compound in the H group against every compound in the K group.
The combine groups process 240 keeps a running summation, or total, 861 of correlation values for each group H, compound I - group K, compound J pair. For example if the current group H is comprised of compounds 2, 5 and 7 and the current group K is comprised of compounds 4 and 6, then the combine groups process sums the correlation values DATA[1][3], for the group H compound 2 - group K compound 4 compound pair, DATA[4][3], for the group H compound 5 - group K compound 4 compound pair, DATA[6][3], for the group H compound 7 - group K compound 4 compound pair, DATA[1][5], for the group H compound 2 - group K compound 6 compound pair, DATA[4][5], for the group H compound 5 - group K compound 6 compound pair, and DATA[6][5], for the group H compound 7 - group K compound 6 compound pair.
The combine groups process 240 also checks 862 if compound I of group H is the same as compound J of group K. If yes 863, the combine groups process 240 flags 865 compound I of group H as also a member of group K. In a presently preferred embodiment, the combine groups process 240 sets a flag 865, e.g. FLAG, to indicate that group H and group K overlap, e.g., FLAG is set equal to TRUE. The combine groups process 240 then loops 860 to the next compound I in group H. If compound I is not the same 864 as compound J of group K, the combine groups process 240 simply loops 860 to the next I compound in group H.
When all compounds I for group H have been looped 860 through for all 859 compounds J of group K, the combine groups process 240 generates a group similarity score for the group H - group K group pair. In a presently preferred embodiment, a group similarity score for a group H - group K group pair is generated from at least the correlation value for a group H compound - group K compound pair. Thus, in a presently preferred embodiment, the correlation value for a compound from group H and a compound from group K comprises the group similarity score for the group H - group K group pair.
In a more presently preferred embodiment, a group similarity score is a mean cumulative distance value, or average correlation value, for a pair of groups. In this more presently preferred embodiment, the combine groups process 240 generates 866 the mean cumulative distance value for all group H - group K compound pairs, and stores them in an array, e.g., AVG. In this more presently preferred embodiment, the AVG[H][K] value is generated from the running summation 861 of correlation values for each group H, compound I - group K, compound J pair; it is the average of all group H, compound I - group K, compound J correlation values The mean cumulative distance value for a pair of groups serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value for a pair of groups, the more similar the objects in the two groups generally are.
For example, referring to Figure 22, group H is the second group and it is comprised of three compounds - compounds 2, 5 and 7. Group K is the fourth group and it is comprised of two compounds - compounds 4 and 6. Using zero- based indexing, AVG[1][3] is the average correlation value for all the compound pairs in [group H][group K]. Thus, the value of AVG[1][3] is shown in Equation 8.
AVG[1][3] = (DATA[1][3] + DATA[1][5] + DATA[4][3] + Equation 8 DATA[4][5] + DATA[6][3] + DATA[6][5])/6
The combine groups process 240 checks 867 whether FLAG is set TRUE, i.e., whether there is at least one compound in group H that is also in group K. If no 871 , the combine groups process 240 loops 855 to the next group K. If, however, FLAG is set TRUE 872, indicating there is at least one compound in group H that is also in group K, the combine groups process 240 writes, or otherwise stores or assigns, all of the compounds of group H that are not already a member of group K to group K. In a presently preferred embodiment, the combine groups process 240 loops I 868 times, once for each of the compounds in group H. For each compound I, the combine groups process 240 checks 869 whether it has been flagged as being in group K. If the Ith compound in group H is 873 flagged as being in group K, the combine groups process 240 loops 868 to the next compound I in group H. If compound I in group H is not 874 flagged as being in group K, compound I is written, or stored, 870 to group K. The combine groups process 240 then loops 868 to the next compound I.
When all compounds I in group H are looped 868 through, the combine groups process 240 marks, or labels, 876 group H as an overlapping group, i.e., it is no longer to be used. The combine groups process 240 then loops 850 to the next group H, and begins the process anew.
As shown in Figure 8, after the combine groups process 240 in the pattern recognition oriented cluster process 200 is executed, an optimally link compounds in group process 245 is executed. Generally, the optimally link compounds in group process 245 optimally orders the compounds in each established group of compounds, based on the respective compound pairs' object similarity score. In a presently preferred embodiment, the optimally link compounds in group process 245 optimally orders the compounds in each group of compounds based on the respective compound pairs' correlation value. The optimally link compounds in group process 245, referring to Figures 23A-23B, loops H 900 times, once for each of the groups of compounds established in the group compounds process 225 of Figure 16, or the group compounds process 650 of Figures 19A-19C. When all of the groups H of compounds are looped 900 through, the optimally link compounds in group process 245 ends 901. The optimally link compounds in group process 245 checks 902 for each group H whether the group is marked as an overlapping group, i.e., the group is not to be used. In a presently preferred embodiment, groups are marked as overlapping groups in the determine overlapping groups process 235, as described with reference to Figure 20, and/or in the combine groups process 240, as described with reference to Figures 21 A-21 B.
If the current group H is 903 marked as an overlapping group, the optimally link compounds in group process 245 loops 900 to the next group H. If, however, the current group H is not 904 marked as an overlapping group, the optimally link compounds in group process 245 loops through all the compounds in group H and locates the two unique compounds with the largest correlation value.
For example, referring to Figure 24, assume exemplary group H comprises compounds C 926, G 927, J 928 and K 929. Referring to exemplary correlation value array DATA 925, the largest correlation value for a unique compound pair for the compounds of exemplary group H is the correlation value 930 for compounds G 927 and J 928, which is equal to 0.9; i.e., DATA[6][9] is equal to 0.9. The other correlation values for the other unique compound pairs in exemplary group H are less than 0.9; i.e., DATA[2][6], for the compound C 926 - compound G 927 compound pair, is 0.6; DATA[2][9], for the compound C 926 - compound J 928 compound pair, is equal to 0.5; DATA[2][10], for the compound C 926 - compound K 929 compound pair, is equal to 0.5; DATA[6][10], for the compound G 927 - compound K 929 compound pair, is equal to 0.6; and, DATA[9][10], for the compound J 928 - compound K 929 compound pair, is equal to 0.8. Once the largest correlation value for a unique compound pair in group H is located 905, the optimally link compounds in group process 245 sets 905 a first variable, e.g. MAX1 , to the compound row of the largest correlation value. For example, referring to Figure 24, MAX1 is set to six, which is the row compound index of the DATA array 925 corresponding to compound G 927. The optimally link compounds in group process 245 also sets 905 a second variable, e.g., MAX2, to the compound column of the largest correlation value. Referring to Figure 24, MAX2 is set to nine, which is the column compound index of the DATA array 925 corresponding to compound J 928.
The optimally link compounds in group process 245 also flags 906 the compound equal to MAX1 and compound equal to MAX2 as already linked for group H. Thus, in exemplary group H, compound G 927 and compound J 928 of group H are flagged as linked for group H.
The optimally link compounds in group process 245 also sets 906 the MAX1 compound as the current head of the link of compounds for group H, and the MAX2 compound as the current tail of the link of compounds for group H. In exemplary group H, compound G 927 is set as the current head of the link and compound J 928 is set as the current tail of the link. The optimally link compounds in group process 245 then checks 907 whether all the compounds in the current group H are flagged as linked for group H. If yes 908, the optimally link compounds in group process 245 loops 900 to the next group H.
If all the compounds in the current group H are not 909 flagged, the optimally link compounds in group process 245 loops I 910, once for each compound in group H that is not already linked, and locates the largest correlation value of the MAX1 - non-linked compounds in group H pairs. In a presently preferred embodiment, the largest correlation value for a MAX1 - non-linked compound in group H pair is set to a variable, e.g., MAXCORR.
As the optimally link compounds in process 245 loops 910 through all compounds in group H that are not already linked, it also locates the largest correlation value of the MAX2 - non-linked compounds in group H pairs. In a presently preferred embodiment, the largest correlation value for a MAX2 - non- linked compound in group H pair is set to a variable, e.g., MINCORR.
For exemplary group H, referring again to Figure 24, MAX1 is set to the DATA array index for compound G 927 and MAX2 is set to the DATA array index for compound J 928. Compounds C 926 and K 929 of group H are not linked for the group yet. Thus, the optimally link compounds in group process 245 finds the largest correlation value between the MAX1 - compound C 926 and MAX1 - compound K 929 compound pairs, and stores it in MAXCORR. More specifically, for exemplary group H, the optimally link compounds in group process 245 checks the correlation value DATA[6][2], i.e., the correlation value for the MAX1 compound G 927 - compound C 926 pair, and the correlation value DATA[6][10], i.e., the correlation value for MAX1 compound G 927 - compound K 929 pair. In a presently preferred embodiment, if more than one MAX1 - non-linked compound in group H pair produces the same maximum correlation value, the optimally link compounds in group process 245 sets MAXCORR to the correlation value for the first MAX1 - non- linked compound pair. Thus, in exemplary DATA array 925, as DATA[6][2], i.e., 0.6, is equal to DATA[6][10], i.e., 0.6, MAXCORR is set to DATA[6][2] and the non-linked compound associated with MAXCORR is compound C 926. The correlation value DATA[6][6], i.e., the correlation value for the MAX1 compound G 927 - MAX1 compound G 927 pair, and the correlation value DATA[6][9], i.e., the correlation value for the MAX1 compound G 927 - MAX2 compound J 928, are not checked as both compounds G 927 and J 928 are already linked for group H.
As previously noted, for exemplary group H, MAX1 is set to the DATA array index for compound G 927 and MAX2 is set to the DATA array index for compound J 928. Compounds C 926 and K 929 of group H are not linked for the group yet. Thus, the optimally link compounds in group process 245 also finds the largest correlation value between the MAX2 - compound C 926 and MAX2 - compound K 929 compound pairs, and stores it in MINCORR. More specifically, for exemplary group H, the optimally link compounds in group process 245 checks the correlation value DATA[9][2], i.e., the correlation value for the MAX2 compound J 928 - compound C 926 pair, and the correlation value DATA[9][10], i.e., the correlation value for the MAX2 compound J 928 - compound K 929 pair. As DATA[9][10], i.e., 0.8, is greater DATA[9][2], i.e., 0.5, MINCORR is set to DATA[9][10] and the non- linked compound associated with MINCORR is compound K 929. The correlation value DATA[9][6], i.e., the correlation value for the MAX2 compound J 928 - MAX1 compound G 927 pair, and the correlation value DATA[9][9], i.e., the correlation value for the MAX2 compound J 928 - MAX2 compound J 928 pair, are not checked as both compounds G 927 and J 928 are already linked for group H. As previously discussed, the optimally link compounds in group process 245 locates 910 the maximum correlation value, e.g., MAXCORR, for the MAX1 compound - non-linked compound pairs. The optimally link compounds in group process 245 also locates 910 the maximum correlation value, e.g., MINCORR, for the MAX2 compound - non-linked compound pairs. Once both MAXCORR and MINCORR are located for the respective MAX1 and MAX2 compound rows of the correlation value array, e.g., DATA, the optimally link compounds in group process 245 checks 911 whether MAXCORR is greater than MINCORR. If yes 914, the non- linked compound in group H associated with MAXCORR has a stronger correlation, or similarity, with the current head of the link of group H compounds than does the non-linked compound in group H associated with MINCORR have with the current tail of the link of group H compounds. Thus, the optimally link compounds in group H process 245 links 912 the non-linked compound associated with MAXCORR as the new head of the link of group H compounds. The optimally link compounds in group process 245 also flags 916 the new link head compound as linked for group H. The variable MAX1 is also set 917 to the DATA array index for the new link head compound. The optimally link compounds in group process 245 then checks 907 whether all the compounds in group H are now flagged as linked for group H.
As previously described, in exemplary group H the non-linked compound corresponding to the correlation value MAXCORR is compound C 926. If MAXCORR had been greater than MINCORR, compound C 926 would be linked to the preceding link head for group H, compound G 927. Compound C 926 would then be the new link head for group H and MAX1 would be set to the DATA array index for compound C 926. Further, compound C 926 would be flagged as linked for group H. However, in exemplary DATA array 925 of Figure 24, MAXCORR, i.e., 0.6, is not greater than MINCORR, i.e., 0.8.
If MAXCORR is not 915 greater than MINCORR, the non-linked compound in group H associated with MINCORR has a stronger correlation, or similarity, with the current tail of the link of group H compounds than does the non-linked compound in group H associated with MAXCORR have with the current head of the link of group H compounds. Thus, the optimally link compounds in group H process 245 links 913 the non-linked compound associated with MINCORR as the new tail of the link of group H compounds.
The optimally link compounds in group process 245 also flags 918 the new link tail compound as linked for group H. The variable MAX2 is also set 919 to the DATA array index for the new link tail compound. The optimally link compounds in group process 245 then checks 907 whether all the compounds in group H are now flagged as linked for group H.
As previously described, in exemplary group H the non-linked compound corresponding to the correlation value MINCORR is compound K 929. As MAXCORR is not greater than MINCORR, compound K 929 is linked to the current link tail for group H, compound J 928. Compound K 929 is the new link tail for group H and MAX2 is set to the DATA array index for compound K 929. Further, compound K 929 is flagged as linked for group H. Thus, at this time, the link for exemplary group H is shown in Equation 9. compound G - compound J - compound K Equation 9 In Equation 9, compound G 927 is the head of the link and compound K 929 is the tail of the link for group H.
In exemplary group H, there is only one remaining compound, compound C 926, to link into group H. As previously discussed, the optimally link compounds in group process 245 locates 910 the maximum correlation value, e.g., MAXCORR, for the MAX1 compound - non-linked compounds in group H pairs. The optimally link compounds in group process 245 also locates 910 the maximum correlation value, e.g., MINCORR, for the MAX2 compound - non-linked compounds in group H pairs. For exemplary group H, the current MAX1 compound is compound G 927 and the current MAX2 compound is compound K 929. The only non-linked compound remaining in group H is compound C 926. Referring to exemplary DATA array 925 of Figure 24, the optimally link compounds in group process 245 checks 910 the correlation value DATA[6][2], i.e., the correlation value for the MAX1 compound G 927 -compound C 926 pair. As the correlation value DATA[6][2] is 0.6, and there are no other MAX1 compound - non-linked compound correlation values to check, as there are no other compounds in group H to link, MAXCORR is set to 0.6
The optimally link compounds in group process 245 also checks 910 the correlation value DATA[10][2], i.e., the correlation value for the MAX2 compound K 929 - compound C 926 pair. As the correlation value DATA[10][2] is 0.5, and there are no other MAX2 compound - non-linked compound correlation values to check, as there are no other compounds in group H to link, MINCORR is set to 0.5
On locating 910 a new MAXCORR and a new MINCORR value, the optimally link compounds in group H process 245 checks 911 whether MAXCORR is greater than MINCORR. In exemplary group H, at this time, MAXCORR, i.e., 0.6, is greater than MINCORR, i.e., 0.5.
As MAXCORR is greater than MINCORR, the optimally link compounds in group process 245 links 912 the non-linked compound corresponding to MAXCORR, i.e., compound C 926, to the preceding link head for group H, compound G 927. Compound C 926 is the new link head for group H and compound C 926 is flagged 916 as linked for group H. Further, MAX1 is set 917 to the DATA array index for compound C 926. Thus, at this time, the link for exemplary group H is shown in Equation 10. compound C - compound G - compound J - compound K Equation 10
In Equation 10, compound C 926 is the head of the link and compound K 929 is the tail of the link for group H.
In exemplary group H, there are now no remaining compounds to link. Thus, the optimally link compounds in group process 245 loops 900 to the next group H. As previously described, when all groups H have been looped 900 through, the optimally link compounds in group process 245 is ended 901.
Referring to Figure 8, after the optimally link compounds in group process 245 is executed for the pattern recognition oriented cluster process 200, an optimally link groups process 250 is executed. Generally, the optimally link groups process 250 optimally orders the groups of compounds, based on the group similarity scores. In a presently preferred embodiment, the optimally link groups process 250 optimally orders the groups of compounds based on the mean cumulative distance values of the respective pairs of groups. As previously described, with reference to Figure 21 A, an average correlation value, or mean cumulative distance value, e.g., AVG[group1][group2], is generated for each group pair, i.e., each two groups of the groups formed of the objects to be clustered, in the combine groups process 240. The mean cumulative distance value for a pair of groups serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value for a pair of groups, the more similar the objects in the two groups generally are.
Referring to Figure 26, in a presently preferred embodiment, the combine groups process 240 only generates mean cumulative distance values for the top half 1052 of the array of average correlation values for all group pairs. In a presently preferred embodiment, as shown in Figure 25, the optimally link groups process 250 first writes, or otherwise copies or stores, 1041 the mirror image of the top half of the average correlation value array, e.g., AVG, to the bottom half of the array.
For example, referring to Figure 26, the optimally link groups process 250 writes 1041 all the values in the top half 1052 of the AVG array 1050 to the respective entries in the bottom half 1051 of the AVG array 1050. In the example of Figure 26, the average correlation value in AVG[0][1] is written to AVG[1][0]. Likewise, the average correlation value in AVG[0][2] is written to AVG[2][0]; the average correlation value in AVG[0][3] is written to AVG[3][0]; and so on. The diagonal 1053 of the AVE array 1050 is generally not relevant, as it represents the average correlation value for one group, and not for a pair of groups.
In an alternative embodiment, the combine groups process 240, after generating the top half of the average correlation value array, e.g., AVG, writes, or otherwise copies, the mirror image of the top half of the AVG array, to the bottom half of the AVG array. In another alternative embodiment, the combine groups process 240, after generating a mean cumulative distance value for a pair of groups, writes the value to both the top half and the bottom half of the AVG array.
Referring again to Figure 25, once the optimally link groups process 250 copies, or otherwise writes or stores, 1041 the top half of the array of average correlation values to the bottom half, for all groups that have not been previously labeled as overlapping, the optimally link groups process 250 locates 1025 the two groups, e.g., groupl and group2, with the largest mean cumulative distance value. As previously described, in a presently preferred embodiment overlapping groups are identified and labeled in the determine overlapping groups process 235, as described with reference to Figure 20, and/or in the combine groups process 240, as described with reference to Figures 21 A-21 B. Groups marked as overlapping are not to be further processed.
In a presently preferred embodiment, once the two groups, e.g., groupl and group2, with the largest mean cumulative distance value are located 1025, e.g., AVE[group1][group2], the optimally link groups process 250 sets 1026 the head of the link of groups to groupl and the tail of the link of groups to group2. The optimally link groups process 250 also flags 1026 groupl and group2 as linked groups.
The optimally link groups process 250 then loops 1027 through all groups that are not already linked and are not flagged as overlapping. For each loop 1027, the optimally link groups process 250 locates 1028 the non-linked group I, that with the current head of the link group, or simply head group, has the largest mean cumulative distance value. In a presently preferred embodiment, the optimally link groups process 250 sets a variable, e.g., MAXCORR, to this largest mean cumulative distance value, e.g., AVE[head][l].
For each loop 1027, the optimally link groups process 250 also locates 1029 the non-linked group J, that with the current tail of the link group, or simply tail group, has the largest mean cumulative distance value. In a presently preferred embodiment, the optimally link groups process 250 sets a variable, e.g., MINCORR, to this largest mean cumulative distance value, e.g., AVE[tail][J].
The optimally link groups process 250, after setting MAXCORR and MINCORR, checks 1030 whether MAXCORR is greater than MINCORR. If yes 1033, the group I, that with the head group has the largest average correlation value, is more similar to the head group than the group J, that with the tail group has the largest average correlation value, is similar to the tail group. The optimally link groups process 250 sets 1031 the head of the link of groups to group I, i.e., the current head group is set equal to group I. Group I is the new current head group, and it is linked to the previous head group, groupl . The optimally link groups process 250 also flags 1031 group I as a linked group.
If, however, the optimally link groups process 250 checks 1030 whether MAXCORR is greater than MINCORR and it is not 1034, the group J, that with the tail group has the largest average correlation value, is more similar to the tail group than the group I, that with the head group has the largest average correlation value, is similar to the head group. The optimally link groups process 250 sets 1032 the tail of the link of groups to group J, i.e., the current tail group is set equal to group J. Group J is the new current tail group, and the previous tail group, group2, is linked to group J. The optimally link groups process 250 also flags 1032 group J as a linked group.
Whether group I is linked 1031 as the new head group or group J is linked 1032 as the new tail group, the optimally link groups process 250 then checks 1035 whether all non-overlapping groups are linked. If yes 1036, the optimally link groups process 250 optimally links 1040 the compounds in all the linked groups, as further described below. The optimally link groups process 250 is then ended 1038. If, however, there are more groups to link 1037, the optimally link groups process 250 loops 1027 again through all the non-overlapping groups that are not already linked. Exemplary AVE array 1050 of average correlation values, or mean cumulative distance values, for groups, as shown in Figure 26, stores the mean cumulative distance values for five groups of compounds. The bottom half 1051 of the AVE array 1050 is the mirror image of the top half 1052 of the array 1050. The diagonal 1053 of the AVE array 1050 is generally not relevant, as it represents the average correlation value for one group, and not a group pair. In the present example, none of the five groups represented in the AVE array 1050 are flagged as overlapping, i.e., non-usable, groups at this time.
Referring to Figures 25 and 26, the optimally link groups process 250 locates
1025 the two groups with the largest average correlation, or mean cumulative distance, value. Group A - group D have the largest mean cumulative distance value in AVE array 1050, i.e., AVE[0][3] is 0.9. The optimally link groups process 250 sets
1026 the head of the link of groups to the first group, group A, and the tail of the link of the groups to the second group, group D. At this time, the link of groups is as shown in Equation 11. A - D Equation 11
In Equation 11 , group A is the head of the link and group D is the tail of the link. Group A and group D are also flagged 1026 as linked. The optimally link groups process 250 then loops 1027 through ail non-overlapping groups that are not already linked, i.e., groups B, C and E. The optimally link groups process 250 locates 1028 the non-linked group that with the current head group has the largest mean cumulative distance value. In exemplary AVE array 1050, the non-linked group B with the head group A has the largest mean cumulative distance value, i.e., AVE[0][1] equals 0.7. The optimally link groups process 250 sets 1028 the variable MAXCORR equal to AVE[0][1], i.e., 0.7.
The optimally link groups process 250 also locates 1029 the group that with the current tail group has the largest mean cumulative distance value. In exemplary AVE array 1050, the non-linked group C with the tail group D has the largest mean cumulative distance value, i.e., AVE[3][2] equals 0.6. The optimally link groups process 250 sets 1029 the variable MINCORR equal to AVE[3][2], i.e., 0.6.
The optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR. It is 1033, as MAXCORR is now equal to 0.7 and MINCORR is equal to 0.6, so the optimally link groups process 250 links 1031 the non-linked group B associated with MAXCORR as the new head group. The original head group, i.e., group A, is linked to the new head group B. At this time, the link of groups is as shown in Equation 12.
B - A - D Equation 12 In Equation 12, group B is the head of the link and group D remains the tail of the link.
Group B, the new head group, is also flagged 1031 as linked. The optimally link groups process 250 then checks 1035 whether all groups have been linked. They have not 1037; groups C and E remain to be linked.
Thus, the optimally link groups process 250 once again loops 1027 through all non-overlapping, non-linked groups, i.e., groups C and E.
The optimally link groups process 250 locates 1028 the group that with the current head group, i.e., group B, has the largest mean cumulative distance value. In exemplary AVE array 1050, the non-linked group C with the head group B has the largest mean cumulative distance value, i.e., AVE[1][2] equals 0.8. The optimally link groups process 250 sets 1028 the variable MAXCORR to 0.8.
The optimally link groups process 250 also locates 1029 the group that with the current tail group, i.e., group D, has the largest mean cumulative distance value. In exemplary AVE array 1050, the non-linked group C with the tail group D has the largest mean cumulative distance value, i.e., AVE[3][2] equals 0.6. The optimally link groups process 250 sets 1029 the variable MINCORR to 0.6.
The optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR. It is 1033, as MAXCORR is now equal to 0.8 and MINCORR is equal to 0.6, so the optimally link groups process 250 links 1031 the non-linked group C associated with MAXCORR as the new head group. The original head group, i.e., group B, is linked to the new head group C. At this time, the link of groups is as shown in Equation 13.
C - B - A - D Equation 13 In Equation 13, group C is the head of the link and group D remains the tail of the link.
Group C is also flagged 1031 as linked. The optimally link groups process 250 then checks 1035 whether all groups have been linked. They have not 1037; group E still remains to be linked. Thus, the optimally link groups process 250 once again loops 1027 through all non-linked, non-overlapping groups, i.e., group E.
As group E is the only remaining non-linked, non-overlapping group, the variable MAXCORR is set to the average correlation value for the head group C - group E pair, i.e., AVE[2][4] equal to 0.5. Also, as group E is the only remaining non- linked, non-overlapping group, the variable MINCORR is set to the average correlation value for the tail group D - group E pair, i.e., AVE[3][4] equal to 0.5. The optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR. It is not 1034, as MAXCORR is equal to MINCORR at this time, so the non-linked group associated with MINCORR, i.e., group E, is linked 1032 as the new tail group. The final link of groups is as shown in Equation 14.
C - B - A - D - E Equation 14
Group E is also flagged 1032 as linked. The optimally link groups process 250 then checks 1035 whether all groups have been linked. They are 1036, so the optimally link groups process 250 optimally links 1040 the compounds in all the linked groups, as further described below. The optimally link groups process 250 then ends 1038.
Referring to Figures 27A-27E, a more detailed presently preferred embodiment of an optimally link groups process 2020 first writes 1079, or otherwise copies, the mirror image of the top half of the average correlation, or mean cumulative distance, value array, e.g., AVG, to the bottom half of the array.
The optimally link groups process 2020 then loops 1080 through all non- overlapping compound groups, i.e., all groups that are not marked as overlapping, and locates the two groups with the largest average correlation value, or mean cumulative distance value.
Upon locating 1080 the group I - J pair with the largest mean cumulative distance value, e.g., AVE[I][J], the optimally link groups process 2020 sets 1080 a variable, e.g., TOK1 , to group I and sets 1080 a second variable, e.g., TOK2, to group J. The TOK1 group is also set 1080 as the original head of the link of groups and the TOK2 group is set 1080 as the original tail of the link of groups. The optimally link groups process 2020 also flags 1081 the TOK1 group and the TOK2 group as linked.
The optimally link groups process 2020 then loops 1082 through all non- overlapping, non-linked groups I. The optimally link groups process 2020 checks 1083 whether all groups have been linked. If no 1086, the optimally link groups process 2020 locates 1085 the TOK1 head group - non-linked group I pair with the largest mean cumulative distance value, e.g., the largest value of AVE[TOK1][l]. The optimally link groups process 2020 sets 1085 a variable, e.g., MAXCORR, to the largest value of AVE[TOK1][l].
In a presently preferred embodiment, if more than one TOK1 head group - non-linked group I pair has the same largest mean cumulative distance value, the optimally link groups process 2020 uses the first I group as the group corresponding to MAXCORR. For example, if the mean cumulative distance value for the TOK1 - group four pair is 0.9 and the mean cumulative distance value for the TOK1 - group seven pair is also 0.9, and 0.9 is the largest mean cumulative distance value for all TOK1 group - non-linked group I pairs, then the optimally link groups process 2020 sets MAXCORR equal to 0.9 and associates the non-linked fourth group with MAXCORR.
The optimally link groups process 2020 also locates 1087 the TOK2 tail group - non-linked group I pair with the largest mean cumulative distance value, e.g., the largest value of AVE[TOK2][l]. The optimally link groups process 2020 sets 1087 a variable, e.g., MINCORR, to the largest value of AVE[TOK2][l].
In a presently preferred embodiment, if more than one TOK2 tail group - non- linked group I pair has the same largest mean cumulative distance value, the optimally link groups process 2020 uses, or otherwise associates, the first I group as the group corresponding to MINCORR. Once the optimally link groups process 2020 locates a current MAXCORR and a current MINCORR, it checks 1090 whether MAXCORR is greater than MINCORR. If yes 1088, the optimally link groups process 2020 sets 1091 the non- linked group I associated with MAXCORR as the new head group. The previous head group, TOK1 , is linked to the new head group I. The new head group I is also flagged 1093 as linked. The variable TOK1 is also set 1095 to the new head group I. The optimally link groups process 2020 then loops 1082 once again through all the non-overlapping, non-linked groups.
If upon checking 1090 whether MAXCORR is greater than MINCORR, it is not 1089, the optimally link groups process 2020 sets 1092 the non-linked group I associated with MINCORR as the new tail group. The previous tail group, TOK2, is linked to the new tail group I. The new tail group I is also flagged 1094 as linked. The variable TOK2 is also set 1096 to the new tail group I. The optimally link groups process 2020 then loops 1082 once again through all the non-overlapping, non- linked groups.
Once all non-overlapping groups are linked 1084, the optimally link groups process 2020 loops H 1097 times, once for each group in the link of groups. In a presently preferred embodiment, the optimally link groups process 2020 loops 1097 from the head group to the tail group of the link of groups. Once all the groups H have been looped 1097 through, the optimally link groups process 2020 is ended 2021.
The optimally link groups process 2020 first checks 1098 whether the current group H is the head group of the link of groups. If it is 2000, the optimally link groups process 2020 sets 2001 a first variable, e.g., A, to the correlation value of the first compound in the head group and the first compound in the second group in the link of groups. With DATA as the array of correlation values for all compound pairs to be clustered, A is set to DATA[1st compound in Head group][1sl compound in 2nd group]. The optimally link groups process 2020 also sets 2001 a second variable, e.g., B, to the correlation value of the first compound in the head group and the last compound in the second group in the link of groups. Thus, B is set to DATA[1st compound in Head group][Last compound in 2nd group].
The optimally link groups process 2020 also sets 2001 a third variable, e.g., C, to the correlation value of the last compound in the head group and the first compound in the second group in the link of groups. Thus, C is set to DATA[Last compound in Head group][1st compound in 2nd group].
The optimally link groups process 2020 also sets 2001 a fourth variable, e.g., D, to the correlation value of the last compound in the head group and the last compound in the second group in the link of groups. Thus, D is set to DATA[Last compound in Head group][Last compound in 2nd group].
The optimally link groups process 2020 then checks 2002 whether C or D is greater than or equal to A and B. Thus, the optimally link groups process 2020 checks 2002 whether either of the correlation values of the head group tail compound - second linked group head and tail compound pairs are greater than or equal to both the correlation values of the head group head compound -second group head and tail compound pairs. If the head group tail compound generates the same or larger correlation value 2003, the compounds in the head group are stored in a table, or list, e.g., LIST, of compounds, in head to tail order. This is because the head group tail compound is more similar to, and, thus, should be linked closest to, the compounds in the second group.
If, however, the head group head compound generates the larger correlation value 2004, the compounds in the head group are stored in LIST in tail to head order. This is because the head group head compound is more similar to, and, thus, should be linked closest to, the compounds in the second group.
Thus, if C or D is greater than A and B 2003, the optimally link groups process
2020 loops 2005 through all the compounds in the head group, storing 2007 the compounds in LIST, in head to tail order. Once all the compounds in the head group are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
If neither C nor D is greater than A and B 2004, the optimally link groups process 2020 loops 2006 through all the compounds in the head group, storing 2008 the compounds in LIST, in tail to head order. Once all the compounds in the head group are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
For every other group H but the head group in the link of groups 1099, the optimally link groups process 2020 checks 2009 whether the correlation value of the last compound in LIST, i.e., the last compound in the previous group H stored in
LIST, and the first compound, i.e., the head compound, in current group H is greater than or equal to the correlation value of the last compound stored in LIST and the last compound, i.e., tail compound, in current group H. With DATA as the array of correlation values for all compound pairs to be clustered, the optimally link groups process 2020 checks whether DATAfLast compound in List][Head compound in current group H] is greater than or equal to DATA[Last compound in List][Tail compound in current group H].
If the head compound in current group H with the last compound stored in
LIST has the same or larger correlation value 2011 , the compounds in group H are stored in LIST in head to tail order. This is because the head compound in group H is more similar to, and, thus, should be linked closest to, the last compound stored in
LIST for the previous group H. Thus, the optimally link groups process 2020 loops
2013 through all the compounds in current group H, storing 2015 the compounds in LIST in head to tail order. Once all the compounds in the current group H are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
If, however, the tail compound in current group H with the last compound stored in LIST has the larger correlation value 2010, the compounds in group H are stored in LIST in tail to head order. This is because the tail compound in group H is more similar to, and, thus, should be linked closest to, the last compound stored in LIST for the previous group H. Thus, the optimally link groups process 2020 loops 2012 through all the compounds in current group H, storing 2014 the compounds in LIST in tail to head order. Once all the compounds in the current group H are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
Once all the compounds in all the groups are stored in an optimal order in LIST, the optimally link groups process 2020 ends 2021. Referring to Figure 8, in a presently preferred embodiment, once the groups of compounds to be clustered are optimally linked by the optimally link groups process 250 of Figure 25, or optimally link groups process 2020 of Figures 27A-27E, the pattern recognition oriented cluster process 200 executes a generate pattern matrix process 255. Generally, the generate pattern matrix process 255 generates and outputs, e.g., to an output file or a computer terminal screen, a pattern matrix of the grouped compounds of the set of compounds to be clustered.
In a presently preferred embodiment, the generate pattern matrix process 255 creates a two-dimensional color-shaded cluster graph representative of the correlation values of the clustered, i.e., grouped, compounds. A presently preferred embodiment of an exemplary cluster graph 2050, as shown in Figure 28, shows thirty-seven compounds grouped into six groups.
In a presently preferred embodiment of a generate pattern matrix process 255, as shown in Figure 29, the titles "class", "compound" and "Code" are printed 2025 in a first output file text row, or line. The generate pattern matrix process 255 then loops X 2026 times, once for each of the set of compounds to be clustered. In a presently preferred embodiment, the compounds X are looped through from the first compound stored in the table LIST to the last compound stored in the table LIST. As previously described, in a presently preferred embodiment, compounds are stored in the table LIST in optimal order in the optimally link groups process 2020. For each compound X, the generate pattern matrix process 255 prints 2027 the respective compound code value in a unique column in the first output file text row, after the "Code" title. For each compound X, the generate pattern matrix process 255 also prints 2028 the respective compound class description, compound name and compound code value for compound X, under the respective titles. One compound class description, name and code value is printed per output file text row, or line, beginning with a second output file text line. In a presently preferred embodiment, the respective compound class description for compound X was previously stored in table DESC[X] when the input file for the set of compounds to be clustered was processed by the input data process 350. In a presently preferred embodiment, the respective compound name for compound X was previously stored in table NAME[X] when the input file for the set of compounds to be clustered was processed by the input data process 350. In a presently preferred embodiment, the respective compound code for compound X was previously stored in table CODE[X] when the input file for the set of compounds to be clustered was processed by the input data process 350.
As previously discussed with reference to the input data process 350 shown in Figures 10A and 10B, in a presently preferred embodiment, if the user executing the pattern recognition oriented cluster process 200 uses an input data file with previously generated correlation values, only compound name values for the table NAME[X] are input. In this case, the generate pattern matrix process 255 only outputs the title "NAME" to an output file, in a first text line. Correspondingly, the generate pattern matrix process 255, for each compound in the set of compounds to be clustered, prints only the respective compound name to the output file, one name per output file text line, or row, beginning with a second output file text line. Each compound name is correspondingly printed in a respective column in a first output text line, or row, of the output file. In yet an altemative embodiment, if a compound mode and compound structure for each compound in the set of compounds to be clustered is input from the input file, along with the respective compound name and code, the generate pattern matrix process 255 prints the titles "mode", "structure", "compound" and "code" in a first text line, or row, in an output file. In this alternative embodiment, the generate pattern matrix process 255 prints the respective compound code values for each compound in the set of compounds to be clustered, one per column, in the first text line in the output file. The generate pattern matrix process 255 also prints the respective compound mode, if there is one, compound structure, if there is one, and compound name for each compound in the set of compounds to be clustered, under the appropriate titles. One compound mode, structure and name information for one compound is printed per text line, or row, of the output file, beginning with a second text line. Exemplary cluster graph 2055, shown in Figure 30, is an example of a cluster graph generated in this alternative embodiment.
Once the compound information is printed for each respective compound in the set of compounds to be clustered, the generate pattern matrix process 255 loops I times 2029, once for each compound in the set of compounds to be clustered. The generate pattern matrix process 255 then loops J times 2030, once for each compound in the set of compounds to be clustered. For each loop I 2029, the generate pattern matrix process 255 loops J 2030 times equal to the number of compounds in the set of compounds to be clustered.
The generate pattern matrix 255 determines 2031 which color group, e.g., COLOR[X], the correlation value for the compound I - compound J pair, e.g., DATA[I][J], is within. In a presently preferred embodiment, the color groups, i.e.,
COLOR[X], and their respective correlation value ranges were previously established during the set grouping parameters process 220. A presently preferred embodiment default number of color groups is six and a presently preferred embodiment default correlation value range for each color group is shown in Table 2. Once the generate pattern matrix process 255 determines 2031 which
COLOR[X] group the DATA[I][J] correlation value is within, it prints 2032 to an output file a block of color corresponding to the respective COLOR[X] group. In a presently preferred embodiment, the generate pattern matrix process 255 also prints 2032 to an output file, in the respective block of color, the value of COLOR[X], i.e., the lower value of the range of correlation values for respective COLORfX]. The block of color corresponding to COLORfX] and the respective value of COLORFX] are printed in the row compound I - column compound J of the output file. Once the generate pattern matrix process 255 loops 2030 through all compounds J, it loops 2029 to the next compound I. In a presently preferred embodiment, once the generate pattern matrix process 255 loops through all compounds I, it prints 2033 to the output file "COLOR_CODE" and "CORRELATION" titles.
The generate pattern matrix process 255 then loops X 2034 times, once for each color group COLOR. For each COLOR[X] group, the generate pattern matrix process 255 prints 2036 to the output file a block of color corresponding to the respective COLOR[X] group. In a presently preferred embodiment, the generate pattern matrix process 255 also prints 2036 to the output file, in the respective block of color, the value of COLOR[X], i.e., the lower value of the range of correlation values for respective COLORfX]. The block of color and respective value of COLOR[X] are printed under the title "COLOR_CODE".
The generate pattern matrix process 255 also prints 2037 to the output file the correlation range corresponding to the respective COLORfX] group. The correlation range for the COLORfX] group is printed under the title "CORRELATION".
In a presently preferred embodiment, only one block of color and corresponding value of COLOR[X], as well as the respective correlation range, are printed per output file text line, or row. Thus, the generate pattern matrix process 255 generates 2038 a carriage return before looping 2034 to the next COLOR[X] group. When all the COLORfX] groups are looped 2034 through, the generate pattern matrix process 255 is ended 2035.
Referring to Figure 8, in a presently preferred embodiment, when the generate pattern matrix process 255 is ended 2035, the pattern recognition oriented cluster process 200 is also ended 260.
Conclusion
Thus, it will be appreciated that the methods and apparatus of the present invention provide versatile tools for evaluating sets of random objects and clustering them into groups based on predefined characteristic(s) of the objects. In particular, the random objects are chemical compounds and the predefined characteristic is similarity of biological activity. While certain embodiments and examples have been used to describe the present invention, many variations are possible and are within the spirit and scope of the invention. Such variations will be apparent to those skilled in the art upon inspection of the specification, drawings and claims herein. Other embodiments are within the following claims.

Claims

CLAIMSWhat is claimed is:
1. A method of pattern recognition oriented clustering for grouping two or more objects of a set of objects, comprising: generating an object similarity score for each pair of objects in said set of objects to be grouped, wherein each said pair of objects comprises two objects from said set of objects; assigning two or more of said objects in said set of objects into two or more groups of objects, wherein the criteria for assigning an object to a group comprises said object similarity scores; ordering said objects of a group of objects, wherein the criteria for ordering said objects comprises said object similarity scores; generating a group similarity score for two or more pairs of groups of objects, wherein each said pair of groups of objects comprises two groups; and, ordering two or more groups of objects, wherein the criteria for ordering said groups comprises said group similarity scores.
2. The method of pattern recognition oriented clustering of claim 1 , wherein said set of objects comprises a set of compounds and each said compound of said set of compounds comprises one or more compound values, wherein each said compound value comprises an interaction of said compound in an environment.
3. The method of pattern recognition oriented clustering of claim 2, wherein said environment is selected from the group consisting of a mutant strain from a set of one or more mutant strains and a gene expression from a set of one or more gene expressions.
4. The method of pattern recognition oriented clustering of claim 2, wherein said object similarity score for a pair of compounds in said set of compounds to be grouped comprises a correlation value, said correlation value comprising said compound values of the first compound of said pair of compounds and further comprising said compound values of the second compound of said pair of compounds.
5. The method of pattern recognition oriented clustering of claim 4, wherein a pair of groups of compounds comprises a first group and a second group, and wherein said group similarity score for a pair of groups of compounds comprises an object similarity score for a pair of compounds, wherein said pair of compounds comprises a compound of the first group of said pair of groups of compounds and a compound of the second group of said pair of groups of compounds.
6. The method of pattern recognition oriented clustering of claim 1 , further comprising: determining a first group of objects comprises all of the objects comprising a second group of objects; and, not ordering said second group of objects.
7. The method of pattern recognition oriented clustering of claim 1 , further comprising: determining a first group of objects comprises a first object; determining a second group of objects comprises said first object; grouping all objects comprising said first group of objects, except said first object, into said second group of objects; and, not ordering said first group of objects.
8. The method of pattern recognition oriented clustering of claim 1 , further comprising generating a pattern matrix for the ordered groups of objects.
9. A method of pattern recognition oriented clustering for grouping two or more objects of a set of objects comprising: generating an object similarity score for each pair of objects in said set of objects to be grouped, wherein each said pair of objects comprises two objects from said set of objects; selecting a first object and a second object from said set of objects, said first and said second object comprising a first pair of objects, wherein said first pair of objects comprises an object similarity score that is greater than a first group selection parameter value; initiating a group of objects comprising said first object and said second object; selecting a third object from said set of objects, wherein said third object and said first object comprises a second pair of objects and said third object and said second object comprises a third pair of objects, and further wherein said second pair of objects or said third pair of objects comprises an object similarity score that is greater than a second group selection parameter value; and, including said third object in said group of objects if said second pair of objects comprises an object similarity score that is greater than a third group selection parameter value and if said third pair of objects comprises an object similarity score that is greater than said third group selection parameter value.
10. The method of pattern recognition oriented clustering of claim 9, wherein said first group selection parameter comprises an initial correlation group limit parameter, said second group selection parameter comprises an add-in correlation group limit parameter and said third group selection parameter comprises a minimum correlation group limit parameter.
11. The method of pattern recognition oriented clustering of claim 9, further comprising: generating an average object similarity score for said first pair of objects, said second pair of objects and said third pair of objects; and, not including said third object in said group of objects if said average object similarity score is less than a minimum average correlation group limit parameter.
12. The method of pattern recognition oriented clustering of claim 9, wherein said set of objects comprises a set of compounds and each said compound of said set of compounds comprises one or more compound values, wherein each said compound value comprises an interaction of said compound in an environment.
13. The method of pattern recognition oriented clustering of claim 12, wherein said object similarity score for a pair of compounds in said set of compounds to be grouped comprises a correlation value, and said correlation value comprises said compound values of the first compound of said pair of compounds and said compound values of the second compound of said pair of compounds.
14. A method of grouping a set of objects, wherein each object of said set of objects comprises one or more raw data values, said method comprising: generating an object similarity score for a pair of objects in said set of objects, wherein said pair of objects comprises two objects from said set of objects and wherein said object similarity score comprises a first raw data value for the first object of said pair of objects and a first raw data value for the second object of said pair of objects; and, assigning said pair of objects to a first group, wherein the criteria for assignment comprises said object similarity score.
15. The method of grouping a set of objects of claim 14, wherein said pair of objects assigned to said first group comprises a first object and a second object, said method further comprising: generating an object similarity score for each pair of objects in said set of objects; selecting a third object from said set of objects, wherein said third object and said first object comprises a second pair of objects and said third object and said second object comprises a third pair of objects, and further wherein said second pair of objects or said third pair of objects comprises an object similarity score that is greater than an add-in correlation group limit parameter; and, including said third object in said first group of objects if said second pair of objects comprises an object similarity score that is greater than a minimum correlation group limit parameter and if said third pair of objects comprises an object similarity score that is greater than said minimum correlation group limit parameter.
16. The method of grouping a set of objects of claim 15, further comprising ordering said first object and said second object and said third object, wherein the criteria for ordering said first and second and third objects comprises said object similarity score for said first pair of objects, said object similarity score for said second pair of objects and said object similarity score for said third pair of objects.
17. The method of grouping a set of objects of claim 14, further comprising: generating an object similarity score for each pair of objects in said set of objects; and, assigning two or more objects in said set of objects into one or more groups, wherein the criteria for assigning an object to a group comprises said object similarity scores.
18. The method of grouping a set of objects of claim 17, wherein said pair of objects assigned to said first group comprises a first object and a second object, and wherein two or more objects have been assigned to two or more groups, said method further comprising: determining a second group of objects comprises said first object; and, assigning all objects comprising said first group that are not assigned to said second group to said second group.
19. The method of grouping a set of objects of claim 14, wherein said set of objects comprises a set of compounds and said raw data values of a compound each comprise an interaction of said compound in an environment.
20. The method of grouping a set of objects of claim 19, wherein said environment is selected from the group consisting of a mutant strain from a set of one or more mutant strains and a gene expression from a set of one or more gene expressions.
21. A method for generating a pattern matrix for a set of groups of objects, wherein each said group of objects of said set of groups of objects comprises one or more pairs of objects, and wherein each pair of objects of a group of objects comprises an object similarity score and each pair of objects of a group of objects comprises two objects, said method for generating a pattern matrix comprising: dividing a first output line into two or more output field columns; outputting an identification of a first object of a pair of objects of said set of groups of objects in a first output field column of said first output line; outputting said identification of said first object in a second output line; outputting an identification of a second object of said pair of objects in a second output field column of said first output line; outputting said identification of said second object in a third output line; and, outputting a block of a color shade in said second output field column of said second output line, wherein the criteria for said color shade comprises said object similarity score for said pair of objects comprising said first object and said second object.
22. The method for generating a pattern matrix for a set of groups of objects of claim 21 , wherein said set of objects comprises a set of compounds and each said compound of said set of compounds comprises one or more compound values, wherein each said compound value comprises an interacfion of said compound in an environment.
23. The method of generating a pattern matrix for a set of groups of objects of claim 22, wherein said environment is selected from the group consisting of a mutant strain from a set of one or more mutant strains and a gene expression from a set of one or more gene expressions.
24. The method of generating a pattern matrix for a set of groups of objects of claim 21 , wherein said block of color shade comprises a block of a shade of gray.
25. The method of generating a pattern matrix for a set of groups of objects of claim 21 , further comprising outputting a second block of said color shade in said first output field column of said third output line.
26. A machine readable medium having stored thereon a program for causing a computer to: generate an object similarity score for a pair of objects in a set of objects, wherein each object of said set of objects comprises one or more raw data values, said pair of objects comprises two objects from said set of objects and said object similarity score comprises a first raw data value for the first object of said pair of objects and a first raw data value for the second object of said pair of objects; and, assign said pair of objects to a first group, wherein the criteria for assignment comprises said object similarity score.
27. The machine readable medium of claim 26, wherein said pair of objects assigned to said first group comprises a first object and a second object, said program further causing a computer to: generate an object similarity score for each pair of objects in said set of objects; select a third object from said set of objects, wherein said third object and said first object comprises a second pair of objects and said third object and said second object comprises a third pair of objects, and further wherein said second pair of objects or said third pair of objects comprises an object similarity score that is greater than an add-in correlation group limit parameter; and, include said third object in said first group of objects if said second pair of objects comprises an object similarity score that is greater than a minimum correlation group limit parameter and if said third pair of objects comprises an object similarity score that is greater than said minimum correlation group limit parameter.
28. The machine readable medium of claim 27, wherein said program further causes a computer to order said first object and said second object and said third object, wherein the criteria for ordering said first and second and third objects comprises said object similarity score for said first pair of objects, said object similarity score for said second pair of objects and said object similarity score for said third pair of objects.
29. The machine readable medium of claim 26, wherein said program further causes a computer to: generate an object similarity score for each pair of objects in said set of objects; and, assign two or more objects in said set of objects into one or more groups, wherein the criteria for assigning an object to a group comprises said object similarity scores.
30. The machine readable medium of claim 29, wherein said pair of objects assigned to said first group comprises a first object and a second object, and wherein two or more objects have been assigned to two or more groups, said program further causing a computer to: determine a second group of objects comprises said first object; and, assign all objects comprising said first group that are not assigned to said second group to said second group.
31. A computer system comprising: a computer and, a data storage device, said data storage device comprising a computer program residing thereon for causing the computer to group two or more objects of a set of objects generating an object similarity score for each pair of objects in said set of objects, wherein each object of said set of objects comprises one or more raw data values, a pair of objects comprises two objects from said set of objects and said object similarity score comprises a first raw data value for the first object of a pair of objects and a first raw data value for the second object of said pair of objects; and, assigning a pair of objects to a first group, wherein the criteria for assignment comprises said object similarity score.
32. The computer system of claim 31 , wherein said pair of objects assigned to said first group comprises a first object and a second object, and wherein said computer program further causes said computer to: select a third object from said set of objects, wherein said third object and said first object comprises a second pair of objects and said third object and said second object comprises a third pair of objects, and further wherein said second pair of objects or said third pair of objects comprises an object similarity score that is greater than an add-in correlation group limit parameter; and, include said third object in said first group of objects if said second pair of objects comprises an object similarity score that is greater than a minimum correlation group limit parameter and if said third pair of objects comprises an object similarity score that is greater than said minimum correlation group limit parameter.
33. The computer system of claim 32, wherein said computer program further causes said computer to order said first object and said second object and said third object, wherein the criteria for ordering said first and second and third objects comprises said object similarity score for said first pair of objects, said object similarity score for said second pair of objects and said object similarity score for said third pair of objects.
34. The computer system of claim 31 , wherein said computer program further causes said computer to assign two or more objects in said set of objects into two or more groups, wherein the criteria for assigning an object to a group comprises said object similarity scores.
35. The computer system of claim 34, wherein said pair of objects assigned to said first group comprises a first object and a second object, and wherein said computer program further causes said computer to: determine a second group of objects comprises said first object; and, assign all objects comprising said first group that are not assigned to said second group to said second group.
PCT/US1999/030175 1998-12-18 1999-12-17 Pattern recognition oriented cluster analysis WO2000036489A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU25902/00A AU2590200A (en) 1998-12-18 1999-12-17 Pattern recognition oriented cluster analysis

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11278698P 1998-12-18 1998-12-18
US60/112,786 1998-12-18

Publications (3)

Publication Number Publication Date
WO2000036489A2 WO2000036489A2 (en) 2000-06-22
WO2000036489A3 WO2000036489A3 (en) 2000-11-30
WO2000036489A9 true WO2000036489A9 (en) 2001-05-10

Family

ID=22345844

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/030175 WO2000036489A2 (en) 1998-12-18 1999-12-17 Pattern recognition oriented cluster analysis

Country Status (2)

Country Link
AU (1) AU2590200A (en)
WO (1) WO2000036489A2 (en)

Also Published As

Publication number Publication date
WO2000036489A2 (en) 2000-06-22
AU2590200A (en) 2000-07-03
WO2000036489A3 (en) 2000-11-30

Similar Documents

Publication Publication Date Title
Hahne et al. Unsupervised machine learning
Seo et al. Interactively exploring hierarchical clustering results [gene identification]
US6466923B1 (en) Method and apparatus for biomathematical pattern recognition
Frank et al. Classification of images of biomolecular assemblies: a study of ribosomes and ribosomal subunits of Escherichia coli
EP1635277A2 (en) System and methods for visualizing and manipulating multiple data values with graphical views of biological relationships
US20060098011A1 (en) Method and apparatus for displaying information
US20020143472A1 (en) Method and display for multivariate classification
Torkkola et al. Self-organizing maps in mining gene expression data
US20040234995A1 (en) System and method for storage and analysis of gene expression data
EP2410447B1 (en) System and program for analyzing expression profile
Neale Individual fit, heterogeneity, and missing data in multigroup sem
Murphy et al. Robust classification of subcellular location patterns in fluorescence microscope images
Rao et al. Partial correlation based variable selection approach for multivariate data classification methods
Cook et al. Exploring gene expression data, using plots
WO2000036489A9 (en) Pattern recognition oriented cluster analysis
Tasoulis et al. Unsupervised clustering of bioinformatics data
US7010430B2 (en) Method for displaying gene experiment data
Zhang et al. VizCluster and its application on classifying gene expression data
JP3936851B2 (en) Clustering result evaluation method and clustering result display method
Vehlow et al. ihat: Interactive hierarchical aggregation table
WO2011033274A1 (en) Apparatus and method for processing cell culture data
Zintzaras et al. Growing a classification tree using the apparent misclassification rate
JP3773092B2 (en) Gene expression pattern display method and apparatus, and recording medium
Ray et al. Dynamic range-based distance measure for microarray expressions and a fast gene-ordering algorithm
Repsilber et al. Developing and testing methods for microarray data analysis using an artificial life framework

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C2

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1/38-38/38, DRAWINGS, REPLACED BY NEW PAGES 1/34-34/34; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase