WO2000036489A9 - Pattern recognition oriented cluster analysis - Google Patents
Pattern recognition oriented cluster analysisInfo
- Publication number
- WO2000036489A9 WO2000036489A9 PCT/US1999/030175 US9930175W WO0036489A9 WO 2000036489 A9 WO2000036489 A9 WO 2000036489A9 US 9930175 W US9930175 W US 9930175W WO 0036489 A9 WO0036489 A9 WO 0036489A9
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- objects
- group
- compound
- pair
- compounds
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Definitions
- the field of this invention relates to chemistry, biochemistry, statistical analysis and computer science.
- it relates to methods and apparatus for discerning characteristic similarities among groups of objects; more particularly, similarities among chemical compounds based on biological activity.
- Chemical space is a term often used to describe the universe of all possible chemical compounds.
- the size of theoretical "chemical space” is enormous.
- the number of potential compounds comprising chemical space is far larger than the number of compounds that can realistically be hoped to be synthesized.
- Compounding the situation is the fact that small changes in molecules can often have profound effects on biological activity. Thus, finding that a particular molecule located at a particular point in chemical space is not biologically active does not necessarily mean that the entire region of chemical space around that molecule is likewise inactive.
- protein space Within the universe of chemical space lies the sub-universe of "protein space.” Again, in theory, protein space is also very large. However, practically speaking, the size of protein space that is used by living organisms is relatively modest. Genomic studies suggest that, across the five kingdoms of living organisms (animals, plants, fungi, bacteria and algae), there are only between twenty and fifty protein "super families,” i.e., families of proteins containing amino acid sequences with significant similarities to sequences in other proteins. The implication of this is that the regions of sequence similarity should have similar folding patterns and, therefore, are likely to be evolutionarily related.
- COMPARE a computer program developed for the National Cancer Institute. COMPARE correlates similarities among growth inhibition patterns of chemical compounds. Compounds are first tested against sixty human cancer cell lines in disease-specific panels. The growth inhibition patterns of the tested compounds are then subjected to COMPARE analysis, which generates predictions about similarities, usually with regard to mechanism of action or molecular target. See, e.g., K. D. Paull, et al., Journal of the National Cancer
- COMPARE a "seed compound”; i.e. a compound of known activity, is tested against the disease-specific panels, and a "fingerprint" of the compound's activity, known as a "mean graph” is generated. The compounds to be compared are then also tested against the disease-specific panels and their respective mean graphs generated. COMPARE then ranks the compounds according to their mechanism of action or molecular target by calculation of Pearson correlation coefficients. COMPARE treats only one compound/target protein pair at a time and only compares raw data, i.e., a value corresponding to the activity of one compound against a panel of biological targets, with similar raw data of another compound.
- Cluster analysis involves procedures for taking a group of objects and “clustering" them in subgroups which differ from other subgroups in meaningful ways.
- chemical compounds a "meaningful way” that subgroups can be formed from a large group of chemical compounds is by their performance properties such as, without limitation, mode action against biological targets.
- FIG. 1 The table 50 in Figure 1 is comprised of three objects, A 10, B 20 and C 30. Each object has interacted with each of four operators, M1 5, M2 15, M3 25 and M4 35. In table 50, each object-operator pair has an associated raw data value 40, representing the interaction between each object and each operator.
- each object-operator pair raw data values By simply mapping each of the object-operator pair raw data values, as shown in Figure 2, the graph 60 for object A 10 appears similar to the graph 70 for object B 20, and dissimilar to the graph 80 for object C 30.
- object A 10 and B 20 are similar, and object C 30 is dissimilar.
- object A 10 and object C 30 appear nearly identical when their object-operator pairs are graphed using similarity scaled axes.
- Graph A 90 and graph C 94 have a virtually identical shape in Figure 3, while graph B 92 is clearly dissimilar to both graphs A 90 and C 94.
- a false similarity is determined between objects A 10 and B 20, while a true similarity between object A 10 and object C 30 is left undiscovered.
- the results are generally displayed as dendograms, or tree graphs.
- dendograms are prepared by using auxiliary methods such as Ward's method and a Euclidean distance metric. See, for example, Bates. J., Cancer Res. Gin. Oncol.. 1995, 121 :495, 497, Figure 2.
- Figure 4 is an example of a simple dendrogram 100.
- the resultant analysis is limited by the fact that the information contained within the dendogram 100 is not presented in a manner optimally conducive to direct, i.e., visual, human interpretation of the interrelationships among the objects.
- One problem with this hierarchical method is when to stop, i.e., how to determine when clusters revealing optimal interrelationships between compounds have been generated.
- agglomerative approach eventually all data points are joined into one large cluster; with the divisive approach all data points eventually end up as individual point clusters.
- Another problem specific to the hierarchical method is that once an object is allocated to a particular cluster, that allocation, whether or not it is an optimal allocation, is irrevocable; that is, once an object joins a cluster it is never removed from that cluster and fused, or otherwise joined, with objects belonging to some other cluster.
- the partitioning method of cluster analysis does not require that allocation of an object to a cluster be irrevocable. That is, objects may be reallocated if their initial assignments are found to be inaccurate, or otherwise unsatisfactory.
- the partitioning method suffers from the general requirement that the number of final clusters be known and specified in advance; this limitation is a serious shortcoming if the goal is in fact to determine how many clusters exist in a particular group.
- a method which can quickly, accurately and in a dynamic manner, i.e., a manner which permits reallocation of objects from cluster to cluster, as permitted by the partitioning technique but not the hierarchical technique, but which does not also require fore-knowledge of the final number of clusters, a limitation not present in the hierarchical technique, but required in the partitioning technique, extract from raw data the maximum amount of information available regarding the interrelationships among all objects in a particular group, including classification of the objects into correct subgroups, while at the same time minimizing the misclassification of objects into wrong subgroups. It would be further advantageous if the method is fully user interactive, scalable, flexible and robust. The present invention provides such a method.
- the present invention relates to a method for evaluating sets of random objects and clustering, or otherwise grouping, them into groups based on predefined charactehstic(s) of the objects.
- the invention relates to a method and apparatus for evaluating a group of chemical compounds and clustering them into sub-groups based on similarity of biological activity.
- bacterial mutant strains are employed to evaluate a group of chemical compounds. The raw data obtained is assessed by the pattern recognition cluster analysis methods and apparatus of this invention to reveal compounds having similar biological activity.
- a method for evaluating the similarity of biological activity in a group of chemical compounds using a panel of organisms having disparate gene expressions is yet another presently preferred embodiment of this invention.
- the ability of the chemical compounds to affect the expression of the different products of the various genes is detected and provides a measure of the similarity of biological activity of the compounds within the group being tested.
- group of compounds is meant as few as two compounds to as many as 10 9 compounds.
- similar biological activity is meant compounds having a similar pattern of activity against a panel of biological targets, such as, for example and without limitation, a panel of proteins.
- a similar pattern of activity may be manifested simply as a plus or minus; i.e., either a chemical has an effect against a protein in the panel or it does not.
- two compounds display a "similar biological activity” if they are plus against the same proteins in a panel of proteins and minus against the same proteins in the panel.
- a similar pattern of activity may also be a more complex similarity relating to such things as, without limitation, the biochemical target of the effect, the manifestation of the effect or the amount of the effect.
- an effect may be manifested by, again without limitation, a change in cell phenotype or a change in the ability of a protein to perform it biological function.
- Bacterial mutant strain or “mutant strain” refers to a strain of bacteria in which biochemical activity has been modified such that the bacteria exhibit a diminished or an enhanced level of activity with regard to a selected parameter when compared to normal bacteria of the same specie. Examples of such parameters are, without limitation, the ability of bacteria to grow at different temperatures, at different pHs, in the presence of different nutrients, etc.
- a change in the level of activity in the presence of a chemical being tested indicates that the chemical is affecting the biomolecule either directly or indirectly; i.e. by interacting with the biomolecule itself of by interacting with another molecule on which the biomolecule relies to perform its function.
- Gene expression is meant a process by which a living organism manufactures chemical products under the direction of a gene.
- Gene expression can either be a wild type, i.e., the manufacture of a chemical produced by the organism in its natural state, or it may be engineered.
- engineered is meant that genome of the organism is altered such that a gene which expresses a non-natural (for that specie) chemical is incorporated into the genome. Examples, without limitation, of genes which may be engineered into an organism's genome and which express chemicals that are readily detectable are the lux gene, which expresses the enzyme luciferase, and the cat gene, which expresses the enzyme chloramphenicol acetyl transferase.
- gene expression also refers to an environment containing organisms which harbor the genes and which express selected chemicals. While growth inhibition of mutant strains and gene expression are presently preferred embodiments of this invention, it is understood that numerous other indications of similar biological activity including, but not limited to, other biochemical assays, other whole cell assays and the like are within the spirit and scope of this invention.
- two or more objects of a set of objects are grouped into one or more groups, in a presently preferred embodiment, an object similarity score is generated for each pair of objects in the set of objects to be grouped, or clustered. Two or more objects are then assigned, or grouped, into one or more groups of objects.
- the criteria for assignment of an object to a particular group is the object similarity scores generated for the pairs of objects to be clustered.
- the objects of an established group are ordered.
- the criteria for ordering the objects of a group are the object similarity scores for the pairs of objects of the group.
- a group similarity score is generated for each pair of groups of objects.
- the groups of objects are then ordered, based on the group similarity scores.
- a pattern matrix is generated for a set of groups of objects.
- the generated pattern matrix provides a visual representation of the grouping and similarity, or relative similarity, of the grouped objects represented in the respective matrix.
- a general object of the invention is to provide a method and apparatus for optimally clustering, or grouping, a group of objects.
- the grouping is based on the objects' similarity scores with each other.
- the invention provides a method and apparatus for optimally clustering a group of compounds, based on the similarity of the compounds' interactions with various bacterial mutant strains or gene expressions.
- a further general object of the invention is to provide a method and apparatus for displaying the results of a clustering of groups of objects.
- a pattern matrix is generated which provides a visual representation of both the grouping and the similarity, or relative similarity, of the grouped objects represented in the respective matrix.
- the methods and apparatus disclosed are in fact multi-dimensional. That is, clustering of objects can be performed with relation to more than one environment. This can be accomplished using a three- dimensional analysis, in which the pattern matrix generated will likewise be three- dimensional.
- the X-axis of a plot, or pattern matrix could be chemical compounds
- the Y-axis could be chemical compounds
- the Z- axis could be respective correlation values.
- the clusters of objects could be visualized as peaks wherein the strength of correlation would be indicated by the height of the peak.
- Table 1 depicts a presently preferred embodiment of groups of correlation ranges.
- Table 2 depicts a presently preferred embodiment of a default correlation value range and a default lower correlation value for each of a default number of color groups.
- Figure 1 depicts an exemplary object-operator table.
- Figure 2 is a representative graph of the object-operator interrelationship from the table of Figure 1 , without any correlation or similarity measurement scaling.
- Figure 3 depicts representative graphs of the object-operator interrelationships from the table of Figure 1 with similarity measurement scaling.
- Figure 4 depicts an exemplary dendogram of a resultant cluster analysis.
- Figure 5 depicts a presently preferred embodiment of a pattern recognition oriented cluster method.
- Figure 6 depicts a presently preferred embodiment of a cluster process.
- Figure 7 depicts a presently preferred embodiment exemplary pattern matrix output.
- Figure 8 depicts a presently preferred embodiment pattern recognition oriented cluster method processing flow.
- Figure 9 depicts a presently preferred embodiment input data processing flow.
- Figures 10A and 10B depict a presently preferred embodiment input data processing flow.
- Figure 11 depicts an exemplary temporary array of raw data for compound - mutant strain pairs.
- Figure 12 depicts an exemplary DATA array of correlation values for n compounds.
- Figure 13 depicts a presently preferred embodiment calculate correlation value processing flow.
- Figure 14 depicts a presently preferred embodiment determine correlation distribution processing flow.
- Figure 15 depicts a presently preferred embodiment set grouping parameters processing flow.
- Figure 16 depicts a presently preferred embodiment group compounds processing flow.
- Figures 17A and 17B depict exemplary correlation value tables.
- Figure 18 depicts an exemplary correlation value table.
- Figures 19A, 19B and 19C depict a presently preferred embodiment group compounds processing flow.
- Figure 20 depicts a presently preferred embodiment determine overlapping groups processing flow.
- Figures 21 A and 21 B depict a presently preferred embodiment combine groups processing flow.
- Figure 22 depicts an exemplary array of correlation values for seven compounds.
- Figures 23A and 23B depict a presently preferred embodiment optimally link compounds in group processing flow.
- Figure 24 depicts an exemplary correlation value array for a group of twelve compounds.
- Figure 25 depicts a presently preferred embodiment optimally link groups processing flow.
- Figure 26 depicts an exemplary array of mean correlation distance values for five groups of compounds.
- Figures 27A, 27B, 27C, 27D and 27E depict a presently preferred embodiment optimally link groups processing flow.
- Figure 28 depicts a presently preferred embodiment exemplary pattern matrix output.
- Figure 29 depicts a presently preferred embodiment generate pattern matrix processing flow.
- Figure 30 depicts an alternative embodiment exemplary pattern matrix output.
- variable names are provided for various entities used in the described processes. The variable names are thereafter used to describe the respective entity. It will be apparent, however, to one skilled in the art, that the invention may be practiced with other variable names and/or other complimentary entities, without departing from the spirit and scope of the invention.
- a pattern recognition oriented cluster method 110 as shown in Figure 5, raw data is gathered 112 on a set of objects to be clustered.
- the objects to be clustered are chemical compounds, and each raw data value, or compound value, is an interaction value of the respective compound in an environment.
- an environment is a mutant strain and each raw data value of a compound is an interaction value between the compound and a mutant strain of a set of mutant strains.
- a further presently preferred embodiment of this invention is that the environment is a gene expression.
- a similarity measure, or object similarity score, 113 is then generated for each of the object pairs that can be constructed from the set of objects to be clustered.
- a correlation value, or coefficient, 114 for a respective object pair is generated from the raw data points, or compound values, for each respective object of the object pair; i.e., from the objects' interaction values with a set of mutant strains.
- the correlation value 114 of an object pair represents the similarity of the interactions of the objects of the pair with a set of mutant strains or gene expressions.
- the correlation value 114 of one object pair is then compared to the correlation values 114 of every other object pair as a similarity value, or coefficient, 116 of the objects comprising the object pairs.
- the higher the correlation value 114, which represents a similarity value 116, of an object pair the more similar the objects of the object pair are deemed to be.
- the objects, e.g., compounds, of the set of objects are clustered, or grouped, 118, using a combination of clustering processes, or techniques, 117.
- a method of dynamic linkage is used to create groups of similar objects.
- the groups are then analyzed, to discard those that overlap with other, larger groups, comprising the same objects.
- the groups are then merged, in order that one particular object is a member of only one group.
- Each group is then optimized, so that the objects in the group are ordered according to their relative similarity to each other.
- all of the groups are optimized, so that each group is ordered, in relation to every other group, according to the relative similarity of the objects comprising the respective groups.
- a process for cluster display 119 is then performed.
- a pattern matrix is generated and output 120, for displaying the resultant groupings of the set of objects to be clustered.
- the pattern matrix provides a visual representation of the grouping, similarity, and relatively similarity of the objects in a set of objects to be clustered.
- the resultant pattern matrix comprises all the objects in all the optimized groups arranged along both an X-axis and an Y- axis. The X-axis and Y-axis are themselves comprised of the objects of the set of objects to be clustered.
- different colors, or shades of color, corresponding to different respective correlation ranges are established in a group of color codes, to enhance the appearance of the resultant output matrix.
- the corresponding correlation values, i.e., similarity measures, of each object pair, for the set of objects to be clustered, is mapped into a respective color code group, and a block of the respective color is output to the pattern matrix.
- a method for the dynamic linkage of compounds into groups 155 comprises producing all the initial groups of compounds that meet either default, or user- defined, criteria. Groups with shared compounds are then merged 160, in order that only one object is in any one group; i.e., the objects in each of the established groups of objects are mutually exclusive.
- a method of intra-group optimization 165 then optimizes the order of the objects within each group.
- the ordering is accomplished using a nearest neighbor mapping process 170.
- the two objects in a group i.e., the first object and the second object, that are most similar in the group, i.e., their correlation value, or similarity measure, is the highest, are located and linked together.
- One of the two linked objects, e.g., the first object is designated the head of the link of objects
- the other of the two linked objects e.g., the second object
- the object i.e., a third object, of the group, if there is one, that is most similar to either one of the first two objects, i.e., a nearest neighbor object, is located. If the third object is more similar to the first object, it is linked next to the first object; i.e., it becomes the new head of the link of objects for the group. In other words, if the correlation value for the first and third objects of the group is larger, or greater, than the correlation value for the second and third objects of the group, the third object is linked next to the first object. Otherwise, if the third object is more similar to the second object, it is linked next to the second object; i.e., it becomes the new tail of the link of objects for the group. In other words, if the correlation value for the second and third objects of the group is larger than the correlation value for the first and third objects of the group, the third object is linked next to the second object.
- a fourth object of the group if there is one, is located that is most similar to either of the ends, i.e., the head or the tail object, of the existing link of objects of the group.
- the fourth object is then linked to the end object in the group link that it is more similar to.
- the intra-group optimization process 165 proceeds through all the objects of a respective group, until all the objects are linked for the group.
- the resultant link of objects of a group consists of an ordering of the objects that reflects their general relative similarity to one another.
- a method of inter-group optimization 175 optimizes the ordering of the groups of objects.
- a group similarity score is generated for each pair of groups.
- the group similarity score is a mean cumulative distance value, or average correlation value, for a pair of groups. The mean cumulative distance value for a group pair serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value, the more similar the objects in the two groups generally are.
- the ordering of the groups is then accomplished using a nearest neighbor mapping process 180 based on the respective mean cumulative distance values.
- the two groups i.e., a first group and a second group, that are most similar, i.e., that have the largest mean cumulative distance value, are located and linked together.
- One of the two linked groups, e.g., the first group is designated the head of the link of groups
- the other of the two linked groups e.g., the second group
- the group i.e., a third group, if there is one, that is most similar to either the first or second group, i.e., a nearest neighbor group. If the third group and the first group are more similar, i.e., have the higher mean cumulative distance value, the third group is linked next to the first group; i.e., the third group becomes the new head of the link of groups. Otherwise, if the third group and the second group are more similar, i.e., have the higher mean cumulative distance value, the third group is linked next to the second group; i.e., the third group becomes the new tail of the link of groups.
- a fourth group if there is one, that is most similar to one of the end groups, i.e., the head or the tail group, of the existing link of groups is located.
- the fourth group is then linked to the end group in the link of groups that it is most similar to, i.e., that with that end group it has the higher mean correlation distance value.
- the inter-group optimization process 175 proceeds through all the groups, until all the groups are linked.
- the resultant link of groups consists of an ordering of the groups that generally reflects their relative similarity to one another.
- a pattern recognition oriented cluster process 200 is executed by a computer, i.e., a computer or other processing device or entity.
- the computer program comprising the pattern recognition oriented cluster process 200 is stored, or otherwise resides, on a data storage device, for example, but not limited to, e.g., Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), floppy disk, flexible disk, magnetic tape. CD-ROMs, punchcards or papertape.
- ROM Read Only Memory
- PROM Programmable Read Only Memory
- EPROM Erasable Programmable Read Only Memory
- floppy disk floppy disk
- flexible disk flexible disk
- magnetic tape CD-ROMs
- CD-ROMs punchcards or papertape.
- an input data process 210 is executed upon starting 205 the pattern recognition oriented cluster process 200.
- the input data process 210 inputs the data relating to the set of objects to be clustered from an input file.
- the input data process 210 If the input data relating to the set of objects to be clustered is raw data, i.e., a correlation value, or similarity measure, has not been generated for the object pairs in the set of objects to be clustered, then the input data process 210 also generates the correlation value for each object pair. Upon inputting all relevant data, the pattern recognition oriented cluster process 200 then executes a determine correlation distribution process 215. Generally, the determine correlation distribution process 215 generates a distribution of the correlation values, i.e., similarity measures, for all object pairs in the set of objects to be clustered. The pattern recognition oriented cluster process 200 thereafter executes a set grouping parameters process 220.
- the set grouping parameters process 220 allows the user of the pattern recognition oriented cluster process 200 to choose whether to use default, pre-established, grouping and pattern matrix generation parameters, or input their own.
- the parameters that a user may specify for clustering i.e., cluster parameters, include, but are not limited to, an initial correlation group limit parameter, an add-in correlation group limit parameter, a minimum correlation group limit parameter and a minimum average correlation group limit parameter.
- the parameters that a user may specify for pattern matrix generation include, but are not limited to, the number of color groups to be used in the output pattern matrix and the correlation ranges for each of the respective color groups.
- the pattern recognition oriented cluster process 200 also executes a group compounds process 225.
- the group compounds process 225 groups the objects in the set of objects to be clustered into one or more groups, based on default, or, alternatively, user-specified, cluster parameters. Once all the groups that may be established for the set of objects to be clustered are so established by the group compounds process 225, the pattern recognition oriented cluster process 200 executes a determine overlapping groups process 235. Generally, the determine overlapping groups process 235 determines if there are any established groups whose objects are all contained in at least one other group. If so, the determine overlapping groups process 235 marks the group whose objects are all contained in at least one other group as an overlapping group. In a presently preferred embodiment, the pattern recognition oriented cluster process 200 does no further processing of any and all the groups marked as overlapping.
- the pattern recognition oriented cluster process 200 executes a combine groups process 240.
- the combine groups process 240 combines one or more groups that have one or more objects in common.
- the objects in each of the remaining groups are mutually exclusive, i.e.; no one object is in more than one group.
- the pattern recognition oriented cluster process 200 also executes an optimally link compounds in group process 245. Generally, the optimally link compounds in group process 245 optimally orders the objects in each group, based on the respective object pairs' correlation values, or similarity measures. The pattern recognition oriented cluster process 200 thereafter executes an optimally link groups process 250. Generally, the optimally link groups process 250 optimally orders the groups, based on group similarity scores generated for each pair of groups. In a presently preferred embodiment, a group similarity score is a mean cumulative distance value for a pair of groups.
- the pattern recognition oriented cluster process 200 also executes a generate pattern matrix process 255.
- the generate pattern matrix process 255 generates and outputs to, e.g., for example, but not limited to, an output file or a computer terminal screen, a pattern matrix of the grouped objects of the set of objects to be clustered.
- the resultant pattern matrix provides a visual representation of the grouping, similarity, and relative similarity of the objects in the set of objects to be clustered.
- the objects in the set of objects to be clustered are chemical compounds. Each compound is interacted with, or otherwise in, one or more environments.
- an environment is a mutant strain or a gene expression.
- Each raw data value, or compound value, in a respective input file is an interaction value between the respective compound and a mutant strain or gene expression.
- the object similarity score for each object, e.g., compound, pair is a value indicative of the similarity of the objects in the pair when interacting with the various mutant strains in the set of mutant strains or the various gene expressions in the set of gene expressions.
- an input data process 210 is executed. Generally, the input data process 210 inputs the data relating to the set of compounds to be clustered from an input file.
- the input data process 210 allows for either 302 the input file to have raw data, or object similarity score data, for a set of compounds to be clustered. If the input data comprises object similarity score values 304, the data is read in 308 from the input file. Each object similarity score value read in from the respective input file is then stored 310 in an array of object similarity score values, e.g., DATA array. More specifically, each object similarity score for a compound I - compound J compound pair is stored in array Data[l][J]. When all the object similarity scores are read in from the input file and stored in a respective array, the input data process 210 is ended 312.
- the raw data for each compound-mutant strain pair is read into an array.
- any raw data value for a compound I - mutant strain J pair that is without an upper or lower threshold value is set 316 to the respective threshold value.
- the raw data value for any compound I - mutant strain J pair is set 316 to the high threshold value.
- the raw data value for the compound I - mutant strain J pair is set 316 to the low threshold value.
- an object similarity score for compound I is generated 318.
- Each generated object similarity score for a compound I - compound J pair is stored 320 in an array of object similarity scores, e.g., DATA array. More specifically, each generated object similarity score for a compound I - compound J pair is stored in a respective DATA[I][J] entry.
- an input data process 350 requests the user to input one of two data formats to a computer, i.e., a computer or other processing device or entity, executing the pattern recognition oriented cluster process 200.
- the user is requested to input 355 a value of one ("1") if the input file to be used for the pattern recognition oriented cluster process 200 comprises raw data for a set of compounds to be clustered, and a set of mutant strains.
- the user is requested to input 355 a value of two ("2) if the input file comprises object similarity scores for compound pairs in the respective set of compounds to be clustered.
- the user is then requested to input 360 to the computer the input file name for either the raw data, or the object similarity score data.
- the user is also requested to input 365 to the computer the total number of compounds, e.g., N, in the set of compounds to be clustered.
- the total number of compounds in a set of compounds to be clustered should be in the range of two to five hundred.
- the input data process 350 then checks 370 the user-indicated data format for the input file. If the data format is equal to two ("2") 372, meaning the input file contains object similarity score data, the input data process 350 reads in 374 from the input file the names of each the compounds in the set of compounds to be clustered. The input data process 350 stores 374 each compound name in an appropriate entry in a table, e.g., NAME[ ].
- the object similarity score for the first I th compound - first J th compound pair is stored in DATA[0][0].
- the object similarity score for the third I th compound - second J" 1 compound pair is stored in DATA[2][1].
- the array of object similarity scores e.g., DATA array
- DATA array is indexed from zero.
- DATA array is indexed from one.
- the input data process 350 checks 370 the user- indicated data format for the input file. If the data format is equal to one ("1 ") 384, meaning the input file comprises raw data for a set of compounds to be clustered and a set of mutant strains, the input data process 350 requests the user to input 386 to the computer, i.e., computer or other processing device or entity, the total number of mutant strains, e.g., STRN, used for generating the raw data. In a presently preferred embodiment, the total number of mutant strains should be in the range of one to fifty.
- the user is also requested to input 388 to the computer a high limit raw data value, e.g., HIGHROUND.
- the suggested value for HIGHROUND is one hundred.
- the user is also requested to input 390 to the computer a low limit raw data value, e.g., LOWROUND.
- the suggested value for LOWROUND is zero.
- the high limit raw data value and the low limit raw data value are used to bound the input raw data for a set of compounds to be clustered, and a set of mutant strain pairs.
- the input data process 350 then loops I 392 times, once for each of the compounds in the set of compounds, e.g., N compounds, to be clustered.
- the input data process 350 for each compound I, i.e., I th compound, loops J 394 times, once for each entry in the input file for compound I.
- the input data process 350 checks 396 if the input entry from the input file is the first entry for compound I. In a presently preferred embodiment, if it is 426 the first entry for compound I, it is the compound code identification for compound I.
- the input data process 350 reads in 402 the compound code for compound I from the input file, and stores it in an appropriate entry in a table, e.g., CODE[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
- the input data process 350 checks 398 if it is the second entry in the input file for compound I. In a presently preferred embodiment, if it is 430 the second entry for compound I, it is the compound name for compound I.
- the input data process 350 reads in 404 the compound name for compound I from the input file, and stores it in an appropriate entry in a table, e.g., NAME[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
- the input data process 350 checks 400 if it is the third entry in the input file for compound I. In a presently preferred embodiment, if it is 434 the third entry for compound I, it is the compound description identification for compound I.
- the input data process 350 reads in 406 the compound description for compound I from the input file, and stores it in an appropriate entry in a table, e.g., DESC[ ]. The input data process 350 then loops J 394, to process the next input entry for compound I.
- the current input entry for compound I is not 438 the third entry in the input file for compound I, then it is a raw data value, e.g., for example, but not limited to, an interaction value or an inhibitor value, for compound I and one of the set of mutant strains.
- the input data process 350 reads in 408 the raw data value for the compound 1 - mutant strain K pair and stores it in an appropriate entry in a temporary array, e.g., V[I][K].
- the mutant strain K value is related to the loop J 394 value, it is not the same, as loop J 394 loops through more than the respective raw data values for a compound I.
- An exemplary array V 450 depicts raw data, e.g., for example, but not limited to, interaction values or inhibitor values, for n compounds 452 and p mutant strains 454.
- raw data e.g., for example, but not limited to, interaction values or inhibitor values
- exemplary array V 450 there are n times p (n x p) raw data values 456; i.e., one raw data value for each compound n - mutant strain p pair.
- the input data process 350 checks 410 whether the input raw data value for compound I - mutant strain K, e.g., V[I][K], is greater than the high limit raw data value, e.g., HIGHROUND. If it is 412, the input raw data value is set 414 to HIGHROUND, e.g., V[I][K] is set equal to HIGHROUND. The input data process 350 then loops J 394, to process the next input entry for compound I.
- the high limit raw data value e.g., HIGHROUND.
- the input data process 350 checks 418 to see if it is less than the low limit raw data value, e.g., LOWROUND. If it is 420, the input raw data value is set 422 to LOWROUND, e.g., V[I][K] is set equal to LOWROUND. The input data process 350 then loops J 394, to process the next input entry for compound I.
- the low limit raw data value e.g., LOWROUND.
- the input data process 350 loops J 394, to process the next input entry value for compound I.
- the input data process 350 loops J 440 times, once for all the compounds up to and including the I th compound. For example, if compound I is the fifth compound to be processed by the input data process 350, then the input data process 350 loops J five times, through the first to fifth compounds. For each loop J 440, the input data process 350 generates an object similarity score for the compound I - compound J pair.
- the object similarity score generated for a compound I - compound J compound pair is a correlation value.
- the correlation value i.e., the similarity value or measure, for a compound I - compound I pair is one, as compound I is being correlated with itself.
- a correlation value for a compound I - compound J compound pair is generated from at least one raw data value for compound I and at least one raw data value for compound J.
- Equation 1 A presently preferred embodiment of an equation for generating a correlation value for a compound I - compound J pair is shown in Equation 1.
- Equation 1 1 and J are the raw data values for compounds I and J and the respective mutant strains, and / and J are the respective mean values for the compounds I and J.
- the input data process 350 After each correlation value is generated 442 for a respective compound I - compound J pair, the input data process 350 loops J 440, to generate the correlation value for the next compound I - compound J pair. When all compounds J have been looped 440 through, the input data process 350 loops 392 to process the next compound I raw data from the input file. When all compounds I have been looped 392 through, i.e., their respective data is input and processed, the input data process 350 is ended 449.
- the correlation value for a compound pair is stored in an array, e.g., DATA[ ][ ].
- the correlation value for a compound I - compound J pair is stored in DATA[I][J].
- the array of correlation values, e.g., DATA is indexed from zero.
- DATA[0][0] represents the correlation value for the first compound-first compound pair.
- DATA[1][3] represents the correlation value for the second compound-fourth compound pair.
- the DATA array is indexed from one.
- An exemplary DATA array 460 of correlation values is shown in Figure 12, for n compounds.
- DATA array 460 has n compound rows 462 and n compound columns 464.
- each correlation value 466 of DATA array 460 is a correlation value for a compound - compound pair, the DATA array 460 is indexed by compound for both its rows and its columns.
- a DATA array 460 need only have valid entries in its lower half 468.
- the top half 470 of a DATA array 460 is a mirror image of the correlation values in the lower half 468 of the DATA array 460.
- the correlation values on the diagonal 472 of a DATA array 460 all have a value of one. This is because the correlation values on the diagonal 472 are the values for the identical compound pairs, i.e., compound I - compound I pairs.
- the correlation values i.e., similarity values or measures or scores, are necessarily one for a compound I - compound I pair as compound I is being correlated with itself.
- an object similarity score is generated 442 for each compound pair in the set of compounds to be clustered if object similarity scores are not provided in the input file, or otherwise provided to the pattern recognition oriented cluster process 200.
- an object similarity score is a correlation value
- a correlation value for a compound pair is generated from at least one raw data, or compound, value for the first compound of the compound pair and at least one raw data, or compound, value for the second compound of the compound pair.
- the correlation value for a compound pair is derived from all the raw data values for the respective compounds of the compound pair and the mutant strains from the set of mutant strains.
- summation variables e.g., SUM_AA, SUM_AB and SUM_BB are initialized to a value of zero 482.
- the calculate correlation value process 480 generates a correlation value for a compound I - compound J pair.
- the average raw data value, e.g., AVE_A, for compound I is generated 483.
- all compound I - mutant strain raw data values, for all mutant strains are added together, and the sum is divided by the total number of mutant strains to generate the average raw data value, e.g., AVE_A, for compound I.
- the average raw data value, e.g., AVE_B, for compound J is also generated 484.
- all compound J - mutant strain raw data values, for all mutant strains are added together, and the sum is divided by the total number of mutant strains to generate the average raw data value, e.g., AVE_B, for compound J.
- the calculate correlation value process 480 then loops K 485 times, once for each of the mutant strains in the set of mutant strains. For each K loop 485, the calculate correlation value process 480 sets 486 the entry in the temporary array of raw data for the compound I - mutant K pair, e.g., V[I][K], to its original raw data value minus the average raw data value for compound I, e.g., AVE_A, as shown in Equation 2
- V[I][K] V[I][K] - AVE_A Equation 2
- the calculate correlation value process 480 also sets 487 the entry in the temporary array of raw data for the compound J - mutant K pair, e.g., V[J][K], to its original raw data value minus the average raw data value for compound J, e.g., AVE_B, as shown in Equation 3.
- V[J][K] V[J][K] - AVE_B Equation 3
- each raw data, or compound, value, for a respective compound is decreased, or otherwise altered, by the average raw data value for the compound.
- the calculate correlation value process 480 also generates 488 a new value of SUM_AB, SUM_AA and SUM_BB. More specifically, for each loop K 485, the new value for the compound I - mutant K pair in the temporary array of raw data, e.g., V[I][K], and the new value for the compound J - mutant K pair in the temporary array of raw data, e.g., V[J][K], are multiplied together and added to a running sum, e.g., SUM_AB, as shown in Equation 4
- the raw data value for each compound I - mutant K pair, altered by the average raw data value for compound I is multiplied by the raw data value for the respective compound J - mutant K pair, altered by the average raw data value for compound J.
- a running summation, e.g., SUM_AB, of these multiplications is also generated.
- the new value for the compound I - mutant K pair in the temporary array of raw data, e.g., V[I][K] is squared, i.e., multiplied with itself, and added to a running sum, e.g., SUM_AA, as shown in Equation 5.
- Equation 6 the new value for the compound J - mutant K pair in the temporary array of raw data, e.g., V[J][K], is squared, i.e., multiplied with itself, and added to a running sum, e.g., SUM_BB, as shown in Equation 6.
- the raw data value for each compound J - mutant K pair, altered by the average raw data value for compound J, is multiplied by itself.
- a running summation, e.g., SUM_BB, of these multiplications is also generated.
- the correlation value for the compound I - compound J pair is generated 490.
- the correlation value for a compound I - compound J pair is set to SUM_AB divided by the square root of the multiplication of SUM_AA and SUM_BB, as shown in Equation 7.
- a determine correlation distribution process 215 is executed.
- the determine correlation distribution process 215 generates a distribution of the correlation values, i.e., similarity measures, for ail compound pairs in the set of compounds to be clustered.
- the determine correlation distribution process 215 loops 502 through each unique correlation value in the array of correlation values, e.g., DATA array.
- the unique correlation values for a set of compounds to be clustered is stored in the lower half, or bottom, 468 of the DATA array, below, and not including, the diagonal 472.
- the determine correlation distribution process 215 loops 502 through the lower half 468 of the DATA array, i.e., through the unique correlation values.
- the determine correlation distribution process 215 multiplies 504 each unique correlation value by ten ("10").
- the determine correlation distribution process 215 determines 506 the group, of a number of groups of correlation ranges, that the respective unique correlation value, multiplied by a factor of ten, is within.
- Table 1 is a presently preferred embodiment of the groups of correlation ranges.
- Column A of Table 1 represents the presently preferred range of correlation values for each of the nineteen groups.
- Column B of Table 1 represents the presently preferred range of correlation values for each of the nineteen groups multiplied by a factor of ten.
- the determine correlation distribution process 215 keeps a running summation 508, e.g., COUNT[ ], of the number of unique correlation values from the DATA array that are within each of the nineteen groups of correlation ranges.
- the determine correlation distribution process 215 also keeps a running summation 510, e.g., SUM, of all the unique correlation values in the DATA array.
- a running summation 510 e.g., SUM
- the determine correlation distribution process 215 generates 512 the average correlation value from the running summation, e.g., SUM, for all the unique correlation values, i.e., for all unique compound I - compound J pairs, where I does not equal J.
- the determine correlation distribution process 215 also determines 514 the percentage of unique correlation values in each of the nineteen groups of correlation ranges, from the various stored running summations, e.g., COUNTf ], for all the unique correlation values, i.e., for all unique compound I - compound J pairs, where I does not equal J.
- the determine correlation distribution process 215 determines 514 the percentage of unique correlation values in the DATA array that are in the 1 st group, e.g., COUNT[0], of correlation ranges, the percentage of unique correlation values in the DATA array that are in the 2 nd group of correlation ranges, e.g., COUNT[1], and so on.
- a set grouping parameters process 220 is executed.
- the set grouping parameters process 220 allows the user of the pattern recognition oriented cluster process 200 to choose whether to use default, pre-established, grouping and pattern matrix generation, or group selection, parameters, or input their own.
- a presently preferred embodiment of the set grouping parameters process 220 requests 566 the user to input to the computer, i.e., computer or other processing device or entity, whether they wish to change the default criteria for establishing groups of compounds.
- the user is requested 566 to indicate whether or not they wish to choose their own group formation selection, or correlation selecting, or grouping, limits, or parameters.
- the set grouping parameters process 220 expects a YES or NO response from the user. If the user responds with a NO 567, indicating that they wish to use the default correlation selecting limits, or grouping parameters, the set grouping parameters process 220 checks 565 whether there are any correlation values in the DATA array that have values that are greater than or equal to the default first group selection parameter value.
- the set grouping parameters process 220 checks the number of correlation values in the first, second and third groups of correlation ranges, e.g., COUNT[0], COUNT[1] and COUNT[2], which were set in the determine correlation distribution process 215.
- the first group selection parameter is an initial correlation group limit parameter.
- the default initial correlation group limit parameter e.g., INICORR
- the default initial correlation group limit parameter has a value of 0.7.
- the set grouping parameters process 220 ends 570 the pattern recognition oriented cluster process 200. This is because there are no groups that can be formed with the default group selection parameters, or correlation selecting limits.
- the set grouping parameters process 220 sets 571 each of the group selection parameters to a default value.
- a first group selection parameter is used for selecting a first compound pair for a new group; it is a group core defining limit.
- the first pair of compounds of a new group of compounds must have a correlation value greater than or equal to the first group selection parameter.
- the first group selection parameter is an initial correlation group limit parameter, e.g., INICORR, whose default value is 0.7.
- a second group selection parameter is used for selecting an add-in, or new, compound for a group.
- the second group selection parameter is an add-in score, or limit or value, that defines the nearest neighbor compounds of the core group of compounds.
- an add-in, or new, compound to a group of compounds must, with a compound member of the group, have a correlation value greater than or equal to the second group selection parameter.
- the second group selection parameter is an add-in correlation group limit parameter, e.g., MAXCORR, whose default value is 0.6.
- a third group selection parameter is used for defining the furthest neighbor compounds in a core group of compounds.
- an add-in, or new, compound to a group of compounds must, with each compound member of the group, have a correlation value greater than or equal to the third group selection parameter.
- the third group selection parameter is a minimum correlation group limit parameter, e.g., MINCORR, whose default value is 0.4.
- a fourth group selection parameter is used to define the compactness of the established groups of compounds.
- the fourth group selection parameter is mean distance score, or limit, of a potential add- in compound and all the compounds already in the group.
- the average correlation value for the pairs of compounds comprised of the potential add-in compound and each compound already in the group must be greater than or equal to the fourth group selection parameter.
- the fourth group selection parameter is a minimum average correlation group limit parameter, e.g., AVECORR, whose default value is 0.5.
- the set grouping parameters process 220 requests the user to input 569 a value for each of four correlation group selection parameters.
- the set grouping parameters process 220 requests the user to input an initial correlation value for selecting a first compound pair for a new group, e.g. INICORR.
- the set grouping parameters process 220 also requests the user to input a correlation value for adding a new compound to an existing group, e.g., MAXCORR.
- the set grouping parameters process 220 also requests the user to input an average correlation value, e.g., AVECORR, for ail compounds in a group, whenever a new compound is a possible addition to the group.
- the set grouping parameters process 220 also requests the user to input a minimum correlation value for any compound to be added to a group, e.g., MINCORR.
- the set grouping parameters process 220 sets 572 the default number of output color groups, e.g., COLOR[ ], to six, for use by a generate pattern matrix process 255 of the pattern recognition oriented cluster process 200, for formatting and outputting a pattern matrix.
- the set grouping parameters process 220 then requests 575 the user to indicate whether they wish to use their own number of output color groups.
- the set grouping parameters process 220 sets 580 each of the six default color groups to the lower correlation value of their correlation value range. Specifically, in a presently preferred embodiment, the set grouping parameters process 220 sets each of the default color groups, e.g., COLOR[0] to COLOR[5], to a default lower correlation value.
- the default correlation value range and the default lower correlation value for each of the six default color groups is shown in Table 2.
- the user is requested 578 to input to the computer, i.e., computer or other processing device or entity, the number of color groups, e.g., M, that they want.
- the maximum number of color groups that a user may designate is ten.
- the set grouping parameters process 220 For each color group, e.g., COLOR[ ], the set grouping parameters process 220 then requests 579 the user to input to the computer the respective lower correlation value. In a presently preferred embodiment, the set grouping parameters process 220 suggests that the user set the final color group lower correlation value to -1.0. In a presently preferred embodiment, COLOR[ ] is indexed from zero. The set grouping parameters process 220, for each color group, e.g., X from one to the total number of color groups, e.g., M, sets COLOR[X-1] to the user inputted lower correlation value.
- COLOR[ ] is indexed from one.
- the set grouping parameters process 220 for each color group X from one to the total number of color groups M sets COLOR[X] to the user inputted lower correlation value.
- the set grouping parameters process 220 is ended 581. Referring again to Figure 8, upon executing the set grouping parameters process 220 in the pattern recognition oriented cluster process 200, a group compounds process 225 is executed.
- the group compounds process 225 groups the compounds in the set of compounds to be clustered into one or more groups, based on default, or alternatively, user-specified, group selection parameters, or correlation selecting limits or cluster parameters.
- the group compounds process 225 first writes 542 the mirror image of the bottom half of the correlation value array, e.g., DATA, to the top half of the array.
- the group compounds process 225 writes 542 all the values in the bottom half 651 of the DATA array to the respective entries in the top half 653 of the array.
- the resultant DATA array e.g. DATA array 655 of Figure 17B
- the correlation value in DATA[2][1] is written to DATA[1][2].
- the correlation value in DATA[3][1] is written to DATA[1][3]
- the correlation value in DATA[3][2] is written to DATA[2][3].
- the group compounds process 225 loops 526 through all compound pairs in the DATA array.
- the group compounds process 225 attempts to locate 528 an initial compound pair for a new group, e.g., G; i.e., the group compounds process 225 attempts to locate 528 a compound pair whose correlation value is greater than or equal to the initial correlation group limit parameter INICORR.
- the presently preferred default value for INICORR is 0.7.
- the first compound pair with a correlation value greater than or equal to the default value of INICORR are compounds one and three; i.e. DATA[1][3] is equal to 0.7.
- DATA[1][1] while having a correlation value of 1.0, which is greater than INICORR equal to 0.7, is not the first compound pair as it is comprised of a single compound, rather than a compound pair.
- the first two compounds of the new group G are compound one and compound three.
- the group compounds process 225 upon locating 528 an initial pair of compounds for a new group G, loops 529 through all the compound pairs, now attempting to locate 530 a potential compound to add into group G. More specifically, in a presently preferred embodiment, the group compounds process 225 loops 529 through all the possible compound pairs looking for a potential add-in compound, i.e., a compound to be added to the group G, whose correlation value with a compound already in group G is greater than or equal to the add-in correlation group limit parameter MAXCORR.
- a potential add-in compound i.e., a compound to be added to the group G, whose correlation value with a compound already in group G is greater than or equal to the add-in correlation group limit parameter MAXCORR.
- the presently preferred default value for MAXCORR is 0.6.
- the first compound that paired with compounds one or three of the present group G has a correlation value greater than or equal to the default value MAXCORR is compound six; i.e. DATA[3][6], is equal to 0.8.
- compound six is a potential add-in compound to the present group G.
- DATA[1][1] and DATA[1][3] have a correlation value greater than or equal to MAXCORR; i.e., compound one - compound one has a correlation value of one, and the compound one - compound three pair has a correlation value of 0.7.
- compounds one and three, corresponding to the respective correlation values DATA[1][1] and DATA[1][3] are already in group G.
- the first correlation value e.g., DATA[3][1]
- the first correlation value has a correlation value greater than MAXCORR.
- compounds three and one, corresponding to the respective correlation value DATA[3][1] are already in group G.
- the next entry in row 662 with a correlation value greater than MAXCORR is
- the group compounds process 225 Upon locating 530 a potential add-in compound for group G, the group compounds process 225 checks 531 whether the correlation values of the potential add-in compound paired with each compound already in group G are all greater than or equal to the minimum correlation group limit parameter MINCORR.
- the presently preferred default value for MINCORR is 0.4.
- the compounds currently in group G are one and three.
- the potential add-in compound for group G is six.
- the correlation value for the compound one - compound six pair e.g., DATA[1][6] is 0.5, which is greater than the default value MINCORR.
- the correlation value for the compound three - compound six pair e.g., DATA[3][6] is 0.8, which is greater than the default value MINCORR.
- the group compounds process 225 loops 529 to the next compound pair in search of a potential add-in compound for group G. In a presently preferred embodiment, if all the correlation values of the potential add-in compound paired with each compound already in group G are 533 greater than or equal to MINCORR, the group compounds process 225 checks 534 whether, assuming the potential add-in compound were included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter AVECORR.
- the presently preferred default value for AVECORR is 0.5.
- the compounds currently in group G are one and three, and the potential add-in compound for group G is six.
- the group compounds process 225 checks the average correlation value of the compound pairs compound one - compound three, e.g., DATA[1][3], compound one - compound six, e.g., DATA[1][6] and compound three - compound six, e.g., DATA[3][6].
- the correlation value for the compound one - compound three pair e.g., DATA[1][3] is 0.7
- the correlation value for the compound one - compound six pair, e.g., DATA[1][6] is 0.5
- the correlation value for the compound three - compound six pair, e.g., DATA[3][6] is 0.8.
- the average correlation value of DATA [1][3], DATA[1][6] and DATA[3][6] is 0.67, which is greater than the default value AVECORR.
- the group compounds process 225 loops 529 to the next compound pair, in search of a potential add-in compound for group G.
- the group compounds process 225 checks 537 whether the potential add-in compound is already a member of group G.
- the group compounds process 225 loops 529 to the next compound pair, in search of a potential add-in compound for group G. If, however, the potential add-in compound is not 539 a member of group G, the group compounds process 225 adds 540 the potential add-in compound to group G. Thus, in the example of DATA array 660 of Figure 18, compound six is added to group G, which is already comprised of compounds one and three. The group compounds process 225 then loops 529 to the next compound pair, in search of another potential add-in compound for group G.
- the group compounds process 225 loops 529 through all the possible compound pairs for group G, it then loops 526 to the next compound pair for creating a new group G of compounds.
- the group compounds process 225 loops 526 through all compound pairs for initiating new groups of compounds, it is ended 527.
- a more detailed presently preferred embodiment of a group compounds process 650 first writes 648, or otherwise fills in, the mirror image of the bottom half of the correlation value array, e.g., DATA, to the top half of the array.
- the correlation value array e.g., DATA
- the group compounds process 650 then loops 604 through all the compound pairs, attempting to locate 600 a compound X - compound Y pair whose correlation value, e.g., DATA[X][Y], is greater than or equal to INICORR, i.e., the initial correlation group limit parameter and the minimum value for selecting a first compound pair for a group.
- a compound X - compound Y pair whose correlation value is greater than or equal to INICORR is located 600, a new group G of compounds is initiated.
- the group compounds process 650 then loops 601 through each correlation value in the array of correlation values, e.g., DATA. If all correlation values have been looped 601 through for the current group G, the group compounds process 650 continues its loop 604 through all the compound pairs, attempting to locate 600 a new compound X - compound Y pair for initiating a new group G. If all compound pairs have been looped 604 through for generating new groups of compounds, the group compounds process 650 ends 602. As previously described, the group compounds process 650 loops 601 through each correlation value in the DATA array.
- the group compound process 650 For each correlation value in DATA, the group compound process 650 checks 603 whether the row compound of the correlation value, e.g., DATA[row compound][column compound], is equal to compound X, which is the row compound of the initial compound pair of group G, and, thus, already a member of group G. If the row compound of the current correlation value being processed, i.e., the current correlation value, is equal to 619 compound X, the group compounds process 650 loops 601 to the next correlation value in the DATA array.
- the row compound of the correlation value e.g., DATA[row compound][column compound
- the group compounds process 650 loops 601 to the next correlation value in DATA array, i.e., DATA[1][3].
- the group compounds process 650 does not check the correlation values DATA[A][B] where A is equal to X, the row compound of the initial compound pair comprising group G.
- the group compounds process 650 does not check any correlation value in the compound 2 row of the DATA array; i.e., DATA[1][y].
- the group compound process 650 also checks 603 whether the column compound of the correlation value, e.g., DATA[row compound][column compound], is equal to compound Y, which is the column compound of the initial compound pair of group G, and, thus, already a member of group G.
- the group compounds process 650 loops 601 to the next correlation value in the DATA array. For example, if the current correlation value row compound is 2 and column compound is 3, i.e., DATA[1][2], and compound 3 is the Y compound in group G, the group compounds process 650 loops 601 to the next correlation value in DATA array, i.e., DATA[1][3]. Thus, in a presently preferred embodiment, the group compounds process
- the group compounds process 650 does not check any correlation value in the compound 3 column of the DATA array; i.e., DATA[x][2].
- the group compounds process 650 checks 605 whether the row compound of the current correlation value is already a member of group G. Thus, the group compounds process 650 checks 605 whether the row compound of the current correlation value is a member of group G, but not the row compound X comprising the initial pair of compounds of group G. If there is only the initial pair of compounds in group G at this time, i.e., compounds X and Y, then the group compounds process 650 checks 605 whether the row compound of the current correlation value is equal to Y.
- exemplary group G is comprised of initial compound pair compound 1 - compound 3, where compound 1 is the X compound and compound 3 is the Y compound, i.e., DATA[0][2] is the first correlation value found that is greater than or equal to INICORR for new group G.
- the group compounds process 650 checks 605 whether the current correlation value is in the row of compound 3, i.e., DATA[2][y].
- exemplary group G is comprised of initial compound pair compound 2 - compound 4, where compound 2 is the X compound and compound 4 is the Y compound, i.e., DATA[1][3] is the first correlation value that is greater than or equal to INICORR for new group G.
- exemplary group G is also comprised of add-in compound 5.
- the group compounds process 650 checks 605 whether the current correlation value is in the row of compound 4, i.e., DATA[3][y], or in the row of compound 5, i.e., DATA[4][y].
- the group compounds process 650 checks 607 whether the current correlation value is greater than or equal to the group selection parameter for adding a compound to a group, e.g., MAXCORR.
- the group selection parameter for adding a compound to a group e.g., MAXCORR.
- the default value of MAXCORR is 0.6.
- the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
- the group compounds process 650 sets 616 a flag, e.g., FLAG, to indicate that the column compound of the current correlation value may possibly be added to the group G, e.g., FLAG is set to TRUE.
- the group compounds process 650 then loops 608 through all the compounds already in group G.
- the group compounds process 650 checks 609 whether the correlation value of the group G compound - current correlation value column compound pair, e.g., DATA[compound in group G][current correlation value column compound], is less than the minimum correlation group limit parameter, e.g., MINCORR.
- the default value of MINCORR is 0.4.
- exemplary group G comprises compounds 2, 5, and 6.
- the current correlation value is the correlation value for the row compound 5 - column compound 8 compound pair.
- the group compounds process 650 checks 609 whether any of the correlation values DATA[1][7], for respective compound 2 of group G - compound 8 (column compound of current correlation value) pair, DATA[4][7], for respective compound 5 of group G - compound 8 (column compound of current correlation value) pair, or DATA[5][7], for respective compound 6 of group G - compound 8 (column compound of current correlation value) pair, is less than MINCORR.
- the group compounds process 650 sets 611 a flag, e.g., FLAG, to indicate that the column compound of the current correlation value is not to be added to group G, e.g., FLAG is set to FALSE.
- FLAG a flag
- the group compounds process 650 sets 611 FLAG to FALSE.
- the group compounds process 650 checks 635 whether FLAG is set to TRUE. If no 636, the current correlation value column compound is not to be added to the current group G, and the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
- the group compounds process 650 checks whether, if the current correlation value column compound was included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR.
- the minimum average correlation group limit parameter e.g., AVECORR.
- a presently preferred embodiment default value for AVECORR is 0.5.
- the group compounds process 650 assumes that the current correlation value column compound is a compound of group G and sums 643 the correlation values for all the compound pairs in group G. The group compounds process 650 then determines the average 638, e.g., AVG, correlation value for all the compound pairs in group G, by dividing the summation of the correlation values for all the compound pairs in group G by the number of compound pairs in group G.
- the average 638 e.g., AVG
- the group compounds process 650 sums 643 the correlation values DATA[1][4], for group G compound 2 - group G compound 5 compound pair, DATA[1][5], for group G compound 2 - group G compound 6 compound pair, DATA[4][5], for group G compound 5 - group G compound 6 compound pair, DATA[1][7], for group G compound 2 - current correlation value column compound 8 compound pair, DATA[4][7], for group G compound 5 - current correlation value column compound 8 compound pair, and DATA[5][7], for group G compound 6 - current correlation value column compound 8 compound pair.
- the group compounds process 650 determines the group G compound pair correlation value average 638, e.g., AVG, by dividing the summation of the correlation values for all the group G compound pairs by the number of compound pairs in group G; in our example, the summation of the correlation values of all the compound pairs in group G is divided by six (6).
- group G compound pair correlation value average 638 e.g., AVG
- the group compounds process 650 then checks 639 whether AVG is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR. If AVG is not 640 greater than or equal to AVECORR, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, AVG is 625 greater then or equal to AVECORR, the group compounds process 650 checks 629 whether the current correlation value column compound is already a member of group G.
- the minimum average correlation group limit parameter e.g., AVECORR
- the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, the current correlation value column compound is not 633 already a member of group G, the group compounds process 650 adds 646 the column compound for the current correlation value to group G. The group compounds process 650 then loops 601 to the next correlation value in the array DATA, to look for another potential add-in compound for group G.
- group G currently comprises compounds 2, 5 and 6 and the current correlation value column compound is compound 8.
- the group compounds process 650 checks 605 whether the row compound of the current correlation value, i.e., DATA[row compound][column compound], is already a member of group G. If it is not 621 , the group compounds process 650 checks 606 whether the column compound of the current correlation value is already a member of group G. Thus, the group compounds process 650 checks 606 whether the column compound of the current correlation value is already a member of group G, but not the column compound Y comprising the initial pair of compounds of group G. If there is only the initial pair of compounds in group G at this time, i.e., compounds X and Y, then the group compounds process 650 checks 606 whether the column compound of the current correlation value is equal to X.
- exemplary group G is comprised of initial compound pair compound 1 - compound 3, where compound 1 is the X compound and compound 3 is the Y compound of group G.
- the group compounds process 650 checks 606 whether the current correlation value is in the column of compound 1 , i.e., DATA[x][0].
- exemplary group G is comprised of initial compound pair compound 2 - compound 4, where compound 2 is the X compound and compound 4 is the Y compound of group G. Further, exemplary group G is also comprised of add-in compound 5.
- the group compounds process 650 checks 606 whether the current correlation value is in the column of compound 2, i.e., DATA[x][1], or in the column of compound 5, i.e., DATA[x][4]. If the column compound for the current correlation value is not 623 equal to a compound already in group G, other than the Y compound of group G, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G.
- the group compounds process 650 loops 601 to the next correlation value in the array DATA because compound 7, the current correlation value column compound, is not compound 2 or compound 5, the compounds in group G other than the Y compound.
- the group compounds process 650 checks 607 whether the current correlation value is greater than or equal to the group selection parameter value for adding a new compound to a group, e.g., MAXCORR.
- the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G. If the current correlation value is 615 greater than or equal to MAXCORR, the group compounds process 650 sets 617 a flag, e.g., FLAG, to indicate that the row compound of the current correlation value may possibly be added to the group G, e.g., FLAG is set to TRUE.
- a flag e.g., FLAG
- the group compound process 650 then loops 608 through all the compounds already in group G. For each compound in group G, the group compounds process 650 checks 610 whether the correlation value of the current correlation value row compound - group G compound pair, e.g., DATA[current correlation value row compound] [compound in group G], is less than the minimum correlation group limit parameter, e.g., MINCORR.
- exemplary group G comprises compounds 2, 5, and 6.
- the current correlation value is the correlation value for the row compound 1 - column compound 6 compound pair.
- the group compounds process 650 checks 610 whether any of the correlation values DATA[0][1], for respective compound 1 (row compound of current correlation value) - compound 2 of group G pair, DATA[0][4], for respective compound 1 (row compound of current correlation value) - compound 5 of group G pair, or DATA[0][5], for respective compound 1 (row compound of current correlation value) - compound 6 of group G pair, is less than MINCORR.
- the group compounds process 650 sets 612 a flag, e.g., FLAG, to indicate that the row compound of the current correlation value is not to be added to group G, e.g., FLAG is set to FALSE.
- FLAG a flag
- the group compounds process 650 sets 612 FLAG to FALSE.
- the group compounds process 650 checks 635 whether FLAG is set to TRUE. If no 636, the current correlation value row compound is not to be added to the current group G, and the group compounds process 650 loops 601 to the next correlation value in the DATA array, to try and locate a new potential add-in compound for group G.
- the group compounds process 650 checks whether, if the current correlation value row compound was included in group G, the average correlation value of all the compound pairs in group G is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR.
- the minimum average correlation group limit parameter e.g., AVECORR.
- the group compounds process 650 assumes that the current correlation value row compound is a compound of group G and sums 644 the correlation values for all the compound pairs in group G. The group compounds process 650 then determines the average 638, e.g., AVG, correlation value for all the compound pairs in group G, by dividing the summation of the correlation values for all the compound pairs in group G by the number of compound pairs in group G.
- the average 638 e.g., AVG
- group G is comprised of compounds 2, 5 and 6 and the current correlation value row compound is compound 1.
- the group compounds process 650 sums 644 the correlation values DATA[1][4], for group G compound 2 - group G compound 5 compound pair, DATA[1][5], for group G compound 2 - group G compound 6 compound pair, DATA[4][5], for group G compound 5 - group G compound 6 compound pair, DATA[0][1], for current correlation value row compound 1 - group G compound 2 compound pair, DATA[0][4], for current correlation value row compound 1 - group G compound 5 compound pair, and DATA[0][5], for current correlation value row compound 1 - group G compound 6 compound pair.
- the group compounds process 650 determines the group G compound pair correlation value average 638, e.g., AVG, by dividing the summation of the correlation values for all the group G compound pairs by the number of compound pairs in group G; in our example, the summation of the correlation values of all the compound pairs in group G is divided by six (6).
- group G compound pair correlation value average 638 e.g., AVG
- the group compounds process 650 then checks 639 whether AVG is greater than or equal to the minimum average correlation group limit parameter, e.g., AVECORR. If AVG is not 640 greater than or equal to AVECORR, the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, AVG is 626 greater then or equal to AVECORR, the group compounds process 650 checks 630 whether the current correlation value row compound is already a member of group G.
- the minimum average correlation group limit parameter e.g., AVECORR
- the group compounds process 650 loops 601 to the next correlation value in the array DATA, to try and locate a new potential add-in compound for group G. If, however, the current correlation value row compound is not 637 already a member of group G, the group compounds process 650 adds 647 the row compound for the current correlation value to group G. The group compounds process 650 then loops 601 to the next correlation value in the array DATA, to look for another potential add-in compound for group G.
- group G currently comprises compounds 2, 5 and 6, and the current correlation value row compound is compound 1.
- the pattern recognition oriented cluster process 200 executes a determine overlapping groups process 235.
- the determine overlapping groups process 235 determines if there are any established groups whose compounds are all contained in at least one other group. For each group X of compounds, the determine overlapping groups process 235 checks whether the compounds in group X are subsumed within any other group of compounds.
- the determine overlapping groups process 235 loops X 750 times, once for each of the groups of compounds established in the group compounds process 225 of Figure 16, and/or the group compounds process 650 of Figures 19A-19C. When all the groups X of compounds are looped through, the determine overlapping groups process 235 ends 751.
- the determine overlapping groups process 235 loops Y 752 times, once for each of the groups of compounds. When all the groups Y of compounds are looped 752 through, the determine overlapping groups process 235 loops 750 to the next group X of compounds.
- the determine overlapping groups process 235 checks 753 whether group X is group Y. If so 754, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks 753 whether group Y is already labeled an overlapping group, i.e., group Y is no longer to be processed. If group Y is 754 already labeled an overlapping group, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks 753 whether group X is group Y. If so 754, the determine overlapping groups process 235 loops 752 to the next group Y. For each group Y, the determine overlapping groups process 235 also checks 753 whether group Y is already labeled an overlapping group, i.e., group Y is no longer to be processed. If group Y is 754 already labeled an overlapping group, the determine overlapping groups process 235 loops 752 to the
- each of the compounds in group Y is checked, or compared, 760 to each of the compounds in group X.
- the determine overlapping groups process 235 next checks 756 whether all of the compounds in group Y are also in group X. If they are 758, then group Y is marked, or labeled, 759 as an overlapping group, i.e., it is no longer to be used. The determine overlapping groups process 235 then loops 752 to the next group Y. If, however, even one compound in group Y is not 757 in group X, group Y is not marked as an overlapping group at this time, and the determine overiapping groups process 235 loops 752 to the next group Y.
- a group X comprises compounds 1 , 4, 7 and 8 and a group Y comprises compounds 1 , 4 and 7, group Y is marked as an overlapping group, as its compounds are completely subsumed within group X.
- group Y is not marked as an overlapping group, as it comprises a compound, compound 9, that is not also in group X.
- the pattern recognition oriented cluster process 200 upon executing the determine overlapping groups process 235, executes a combine groups process 240.
- the combine groups process 240 combines one or more groups of compounds that have one or more compounds in common. For each group H of compounds, the combine groups process 240 checks whether one or more compounds in group H are also a member of another group K. If yes, the combine groups process 240 combines the group H and group K into one new group.
- the combine groups process 240 loops H 850 times, once for each of the groups of compounds established in the group compounds process 225 of Figure 16, or the group compounds process 650 of
- the combine groups process 240 checks 852 for each group H whether the group is marked as an overlapping group. As previously discussed, in a presently preferred embodiment, groups are marked, or labeled, as overlapping in the determine overlapping groups process 235. If the current group H is 853 marked as an overlapping group, i.e., it is not to be processed anymore, then the combine groups process 240 loops 850 to the next group H.
- the combine groups process 240 loops K 855 times, once for each group that has not already been processed as an H group. For example, if there are ten groups of compounds and current group H is the second group, then K is eight, and the combine groups process 240 loops 855 eight times, through the third through tenth groups.
- the combine groups process 240 checks 856 whether the group is marked as an overlapping group. If it is 858, the combine groups process 240 loops 855 to the next group K.
- the combine groups process 240 loops J 859 times, once for each compound in group K.
- the combine groups process 240 for each compound J in group K, loops I 860 times, once for each compound in the initial group H.
- the combine groups process 240 loops 859 to the next J compound in group K.
- the combine groups process 240 checks every compound in the H group against every compound in the K group.
- the combine groups process 240 keeps a running summation, or total, 861 of correlation values for each group H, compound I - group K, compound J pair. For example if the current group H is comprised of compounds 2, 5 and 7 and the current group K is comprised of compounds 4 and 6, then the combine groups process sums the correlation values DATA[1][3], for the group H compound 2 - group K compound 4 compound pair, DATA[4][3], for the group H compound 5 - group K compound 4 compound pair, DATA[6][3], for the group H compound 7 - group K compound 4 compound pair, DATA[1][5], for the group H compound 2 - group K compound 6 compound pair, DATA[4][5], for the group H compound 5 - group K compound 6 compound pair, and DATA[6][5], for the group H compound 7 - group K compound 6 compound pair.
- the combine groups process 240 also checks 862 if compound I of group H is the same as compound J of group K. If yes 863, the combine groups process 240 flags 865 compound I of group H as also a member of group K. In a presently preferred embodiment, the combine groups process 240 sets a flag 865, e.g. FLAG, to indicate that group H and group K overlap, e.g., FLAG is set equal to TRUE. The combine groups process 240 then loops 860 to the next compound I in group H. If compound I is not the same 864 as compound J of group K, the combine groups process 240 simply loops 860 to the next I compound in group H.
- a flag 865 e.g. FLAG
- the combine groups process 240 When all compounds I for group H have been looped 860 through for all 859 compounds J of group K, the combine groups process 240 generates a group similarity score for the group H - group K group pair.
- a group similarity score for a group H - group K group pair is generated from at least the correlation value for a group H compound - group K compound pair.
- the correlation value for a compound from group H and a compound from group K comprises the group similarity score for the group H - group K group pair.
- a group similarity score is a mean cumulative distance value, or average correlation value, for a pair of groups.
- the combine groups process 240 generates 866 the mean cumulative distance value for all group H - group K compound pairs, and stores them in an array, e.g., AVG.
- the AVG[H][K] value is generated from the running summation 861 of correlation values for each group H, compound I - group K, compound J pair; it is the average of all group H, compound I - group K, compound J correlation values
- the mean cumulative distance value for a pair of groups serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value for a pair of groups, the more similar the objects in the two groups generally are.
- group H is the second group and it is comprised of three compounds - compounds 2, 5 and 7.
- Group K is the fourth group and it is comprised of two compounds - compounds 4 and 6.
- AVG[1][3] is the average correlation value for all the compound pairs in [group H][group K].
- Equation 8 the value of AVG[1][3] is shown in Equation 8.
- AVG[1][3] (DATA[1][3] + DATA[1][5] + DATA[4][3] + Equation 8 DATA[4][5] + DATA[6][3] + DATA[6][5])/6
- the combine groups process 240 checks 867 whether FLAG is set TRUE, i.e., whether there is at least one compound in group H that is also in group K. If no 871 , the combine groups process 240 loops 855 to the next group K. If, however, FLAG is set TRUE 872, indicating there is at least one compound in group H that is also in group K, the combine groups process 240 writes, or otherwise stores or assigns, all of the compounds of group H that are not already a member of group K to group K. In a presently preferred embodiment, the combine groups process 240 loops I 868 times, once for each of the compounds in group H. For each compound I, the combine groups process 240 checks 869 whether it has been flagged as being in group K.
- the combine groups process 240 loops 868 to the next compound I in group H. If compound I in group H is not 874 flagged as being in group K, compound I is written, or stored, 870 to group K. The combine groups process 240 then loops 868 to the next compound I.
- the combine groups process 240 marks, or labels, 876 group H as an overlapping group, i.e., it is no longer to be used.
- the combine groups process 240 then loops 850 to the next group H, and begins the process anew.
- an optimally link compounds in group process 245 is executed.
- the optimally link compounds in group process 245 optimally orders the compounds in each established group of compounds, based on the respective compound pairs' object similarity score.
- the optimally link compounds in group process 245 optimally orders the compounds in each group of compounds based on the respective compound pairs' correlation value.
- the optimally link compounds in group process 245 ends 901.
- the optimally link compounds in group process 245 checks 902 for each group H whether the group is marked as an overlapping group, i.e., the group is not to be used.
- groups are marked as overlapping groups in the determine overlapping groups process 235, as described with reference to Figure 20, and/or in the combine groups process 240, as described with reference to Figures 21 A-21 B.
- the optimally link compounds in group process 245 loops 900 to the next group H. If, however, the current group H is not 904 marked as an overlapping group, the optimally link compounds in group process 245 loops through all the compounds in group H and locates the two unique compounds with the largest correlation value.
- exemplary group H comprises compounds C 926, G 927, J 928 and K 929.
- exemplary correlation value array DATA 925 the largest correlation value for a unique compound pair for the compounds of exemplary group H is the correlation value 930 for compounds G 927 and J 928, which is equal to 0.9; i.e., DATA[6][9] is equal to 0.9.
- the other correlation values for the other unique compound pairs in exemplary group H are less than 0.9; i.e., DATA[2][6], for the compound C 926 - compound G 927 compound pair, is 0.6; DATA[2][9], for the compound C 926 - compound J 928 compound pair, is equal to 0.5; DATA[2][10], for the compound C 926 - compound K 929 compound pair, is equal to 0.5; DATA[6][10], for the compound G 927 - compound K 929 compound pair, is equal to 0.6; and, DATA[9][10], for the compound J 928 - compound K 929 compound pair, is equal to 0.8.
- the optimally link compounds in group process 245 sets 905 a first variable, e.g. MAX1 , to the compound row of the largest correlation value.
- MAX1 is set to six, which is the row compound index of the DATA array 925 corresponding to compound G 927.
- the optimally link compounds in group process 245 also sets 905 a second variable, e.g., MAX2, to the compound column of the largest correlation value.
- MAX2 is set to nine, which is the column compound index of the DATA array 925 corresponding to compound J 928.
- the optimally link compounds in group process 245 also flags 906 the compound equal to MAX1 and compound equal to MAX2 as already linked for group H.
- compound G 927 and compound J 928 of group H are flagged as linked for group H.
- the optimally link compounds in group process 245 also sets 906 the MAX1 compound as the current head of the link of compounds for group H, and the MAX2 compound as the current tail of the link of compounds for group H.
- compound G 927 is set as the current head of the link
- compound J 928 is set as the current tail of the link.
- the optimally link compounds in group process 245 then checks 907 whether all the compounds in the current group H are flagged as linked for group H. If yes 908, the optimally link compounds in group process 245 loops 900 to the next group H.
- the optimally link compounds in group process 245 loops I 910, once for each compound in group H that is not already linked, and locates the largest correlation value of the MAX1 - non-linked compounds in group H pairs.
- the largest correlation value for a MAX1 - non-linked compound in group H pair is set to a variable, e.g., MAXCORR.
- the optimally link compounds in process 245 loops 910 through all compounds in group H that are not already linked, it also locates the largest correlation value of the MAX2 - non-linked compounds in group H pairs.
- the largest correlation value for a MAX2 - non- linked compound in group H pair is set to a variable, e.g., MINCORR.
- MAX1 is set to the DATA array index for compound G 927 and MAX2 is set to the DATA array index for compound J 928.
- Compounds C 926 and K 929 of group H are not linked for the group yet.
- the optimally link compounds in group process 245 finds the largest correlation value between the MAX1 - compound C 926 and MAX1 - compound K 929 compound pairs, and stores it in MAXCORR.
- the optimally link compounds in group process 245 checks the correlation value DATA[6][2], i.e., the correlation value for the MAX1 compound G 927 - compound C 926 pair, and the correlation value DATA[6][10], i.e., the correlation value for MAX1 compound G 927 - compound K 929 pair.
- the optimally link compounds in group process 245 sets MAXCORR to the correlation value for the first MAX1 - non- linked compound pair.
- DATA[6][2] i.e., 0.6
- MAXCORR is set to DATA[6][2]
- the non-linked compound associated with MAXCORR is compound C 926.
- the correlation value DATA[6][6] i.e., the correlation value for the MAX1 compound G 927 - MAX1 compound G 927 pair
- the correlation value DATA[6][9] i.e., the correlation value for the MAX1 compound G 927 - MAX2 compound J 928, are not checked as both compounds G 927 and J 928 are already linked for group H.
- MAX1 is set to the DATA array index for compound G 927 and MAX2 is set to the DATA array index for compound J 928.
- Compounds C 926 and K 929 of group H are not linked for the group yet.
- the optimally link compounds in group process 245 also finds the largest correlation value between the MAX2 - compound C 926 and MAX2 - compound K 929 compound pairs, and stores it in MINCORR.
- the optimally link compounds in group process 245 checks the correlation value DATA[9][2], i.e., the correlation value for the MAX2 compound J 928 - compound C 926 pair, and the correlation value DATA[9][10], i.e., the correlation value for the MAX2 compound J 928 - compound K 929 pair.
- DATA[9][10] i.e., 0.8
- DATA[9][2] i.e., 0.5
- MINCORR is set to DATA[9][10] and the non- linked compound associated with MINCORR is compound K 929.
- the correlation value DATA[9][6], i.e., the correlation value for the MAX2 compound J 928 - MAX1 compound G 927 pair, and the correlation value DATA[9][9], i.e., the correlation value for the MAX2 compound J 928 - MAX2 compound J 928 pair, are not checked as both compounds G 927 and J 928 are already linked for group H.
- the optimally link compounds in group process 245 locates 910 the maximum correlation value, e.g., MAXCORR, for the MAX1 compound - non-linked compound pairs.
- the optimally link compounds in group process 245 also locates 910 the maximum correlation value, e.g., MINCORR, for the MAX2 compound - non-linked compound pairs. Once both MAXCORR and MINCORR are located for the respective MAX1 and MAX2 compound rows of the correlation value array, e.g., DATA, the optimally link compounds in group process 245 checks 911 whether MAXCORR is greater than MINCORR. If yes 914, the non- linked compound in group H associated with MAXCORR has a stronger correlation, or similarity, with the current head of the link of group H compounds than does the non-linked compound in group H associated with MINCORR have with the current tail of the link of group H compounds.
- the maximum correlation value e.g., MINCORR
- the optimally link compounds in group H process 245 links 912 the non-linked compound associated with MAXCORR as the new head of the link of group H compounds.
- the optimally link compounds in group process 245 also flags 916 the new link head compound as linked for group H.
- the variable MAX1 is also set 917 to the DATA array index for the new link head compound.
- the optimally link compounds in group process 245 then checks 907 whether all the compounds in group H are now flagged as linked for group H.
- the non-linked compound corresponding to the correlation value MAXCORR is compound C 926. If MAXCORR had been greater than MINCORR, compound C 926 would be linked to the preceding link head for group H, compound G 927. Compound C 926 would then be the new link head for group H and MAX1 would be set to the DATA array index for compound C 926. Further, compound C 926 would be flagged as linked for group H. However, in exemplary DATA array 925 of Figure 24, MAXCORR, i.e., 0.6, is not greater than MINCORR, i.e., 0.8.
- the optimally link compounds in group H process 245 links 913 the non-linked compound associated with MINCORR as the new tail of the link of group H compounds.
- the optimally link compounds in group process 245 also flags 918 the new link tail compound as linked for group H.
- the variable MAX2 is also set 919 to the DATA array index for the new link tail compound.
- the optimally link compounds in group process 245 then checks 907 whether all the compounds in group H are now flagged as linked for group H.
- the non-linked compound corresponding to the correlation value MINCORR is compound K 929.
- MAXCORR is not greater than MINCORR
- compound K 929 is linked to the current link tail for group H, compound J 928.
- Compound K 929 is the new link tail for group H and MAX2 is set to the DATA array index for compound K 929. Further, compound K 929 is flagged as linked for group H.
- the link for exemplary group H is shown in Equation 9.
- compound G - compound J - compound K Equation 9 In Equation 9, compound G 927 is the head of the link and compound K 929 is the tail of the link for group H.
- the optimally link compounds in group process 245 locates 910 the maximum correlation value, e.g., MAXCORR, for the MAX1 compound - non-linked compounds in group H pairs.
- the optimally link compounds in group process 245 also locates 910 the maximum correlation value, e.g., MINCORR, for the MAX2 compound - non-linked compounds in group H pairs.
- the current MAX1 compound is compound G 927 and the current MAX2 compound is compound K 929.
- the only non-linked compound remaining in group H is compound C 926.
- the optimally link compounds in group process 245 checks 910 the correlation value DATA[6][2], i.e., the correlation value for the MAX1 compound G 927 -compound C 926 pair.
- the correlation value DATA[6][2] is 0.6, and there are no other MAX1 compound - non-linked compound correlation values to check, as there are no other compounds in group H to link, MAXCORR is set to 0.6
- the optimally link compounds in group process 245 also checks 910 the correlation value DATA[10][2], i.e., the correlation value for the MAX2 compound K 929 - compound C 926 pair. As the correlation value DATA[10][2] is 0.5, and there are no other MAX2 compound - non-linked compound correlation values to check, as there are no other compounds in group H to link, MINCORR is set to 0.5
- the optimally link compounds in group H process 245 checks 911 whether MAXCORR is greater than MINCORR.
- MAXCORR i.e., 0.6
- MINCORR i.e., 0.5
- the optimally link compounds in group process 245 links 912 the non-linked compound corresponding to MAXCORR, i.e., compound C 926, to the preceding link head for group H, compound G 927.
- Compound C 926 is the new link head for group H and compound C 926 is flagged 916 as linked for group H.
- MAX1 is set 917 to the DATA array index for compound C 926.
- Equation 10 compound C 926 is the head of the link and compound K 929 is the tail of the link for group H.
- the optimally link compounds in group process 245 loops 900 to the next group H. As previously described, when all groups H have been looped 900 through, the optimally link compounds in group process 245 is ended 901.
- an optimally link groups process 250 is executed.
- the optimally link groups process 250 optimally orders the groups of compounds, based on the group similarity scores.
- the optimally link groups process 250 optimally orders the groups of compounds based on the mean cumulative distance values of the respective pairs of groups.
- an average correlation value, or mean cumulative distance value e.g., AVG[group1][group2] is generated for each group pair, i.e., each two groups of the groups formed of the objects to be clustered, in the combine groups process 240.
- the mean cumulative distance value for a pair of groups serves as a similarity measure of the objects in the two groups; the higher the mean cumulative distance value for a pair of groups, the more similar the objects in the two groups generally are.
- the combine groups process 240 only generates mean cumulative distance values for the top half 1052 of the array of average correlation values for all group pairs.
- the optimally link groups process 250 first writes, or otherwise copies or stores, 1041 the mirror image of the top half of the average correlation value array, e.g., AVG, to the bottom half of the array.
- the optimally link groups process 250 writes 1041 all the values in the top half 1052 of the AVG array 1050 to the respective entries in the bottom half 1051 of the AVG array 1050.
- the average correlation value in AVG[0][1] is written to AVG[1][0].
- the average correlation value in AVG[0][2] is written to AVG[2][0];
- the average correlation value in AVG[0][3] is written to AVG[3][0]; and so on.
- the diagonal 1053 of the AVE array 1050 is generally not relevant, as it represents the average correlation value for one group, and not for a pair of groups.
- the combine groups process 240 after generating the top half of the average correlation value array, e.g., AVG, writes, or otherwise copies, the mirror image of the top half of the AVG array, to the bottom half of the AVG array.
- the combine groups process 240 after generating a mean cumulative distance value for a pair of groups, writes the value to both the top half and the bottom half of the AVG array.
- the optimally link groups process 250 copies, or otherwise writes or stores, 1041 the top half of the array of average correlation values to the bottom half, for all groups that have not been previously labeled as overlapping, the optimally link groups process 250 locates 1025 the two groups, e.g., groupl and group2, with the largest mean cumulative distance value.
- overlapping groups are identified and labeled in the determine overlapping groups process 235, as described with reference to Figure 20, and/or in the combine groups process 240, as described with reference to Figures 21 A-21 B. Groups marked as overlapping are not to be further processed.
- the optimally link groups process 250 sets 1026 the head of the link of groups to groupl and the tail of the link of groups to group2.
- the optimally link groups process 250 also flags 1026 groupl and group2 as linked groups.
- the optimally link groups process 250 then loops 1027 through all groups that are not already linked and are not flagged as overlapping. For each loop 1027, the optimally link groups process 250 locates 1028 the non-linked group I, that with the current head of the link group, or simply head group, has the largest mean cumulative distance value. In a presently preferred embodiment, the optimally link groups process 250 sets a variable, e.g., MAXCORR, to this largest mean cumulative distance value, e.g., AVE[head][l].
- a variable e.g., MAXCORR
- the optimally link groups process 250 For each loop 1027, the optimally link groups process 250 also locates 1029 the non-linked group J, that with the current tail of the link group, or simply tail group, has the largest mean cumulative distance value. In a presently preferred embodiment, the optimally link groups process 250 sets a variable, e.g., MINCORR, to this largest mean cumulative distance value, e.g., AVE[tail][J].
- MINCORR this largest mean cumulative distance value
- the optimally link groups process 250 after setting MAXCORR and MINCORR, checks 1030 whether MAXCORR is greater than MINCORR. If yes 1033, the group I, that with the head group has the largest average correlation value, is more similar to the head group than the group J, that with the tail group has the largest average correlation value, is similar to the tail group.
- the optimally link groups process 250 sets 1031 the head of the link of groups to group I, i.e., the current head group is set equal to group I. Group I is the new current head group, and it is linked to the previous head group, groupl .
- the optimally link groups process 250 also flags 1031 group I as a linked group.
- the optimally link groups process 250 checks 1030 whether MAXCORR is greater than MINCORR and it is not 1034, the group J, that with the tail group has the largest average correlation value, is more similar to the tail group than the group I, that with the head group has the largest average correlation value, is similar to the head group.
- the optimally link groups process 250 sets 1032 the tail of the link of groups to group J, i.e., the current tail group is set equal to group J.
- Group J is the new current tail group, and the previous tail group, group2, is linked to group J.
- the optimally link groups process 250 also flags 1032 group J as a linked group.
- the optimally link groups process 250 then checks 1035 whether all non-overlapping groups are linked. If yes 1036, the optimally link groups process 250 optimally links 1040 the compounds in all the linked groups, as further described below. The optimally link groups process 250 is then ended 1038. If, however, there are more groups to link 1037, the optimally link groups process 250 loops 1027 again through all the non-overlapping groups that are not already linked.
- Exemplary AVE array 1050 of average correlation values, or mean cumulative distance values, for groups, as shown in Figure 26, stores the mean cumulative distance values for five groups of compounds.
- the bottom half 1051 of the AVE array 1050 is the mirror image of the top half 1052 of the array 1050.
- the diagonal 1053 of the AVE array 1050 is generally not relevant, as it represents the average correlation value for one group, and not a group pair. In the present example, none of the five groups represented in the AVE array 1050 are flagged as overlapping, i.e., non-usable, groups at this time.
- Group A - group D have the largest mean cumulative distance value in AVE array 1050, i.e., AVE[0][3] is 0.9.
- the optimally link groups process 250 sets
- group A is the head of the link and group D is the tail of the link.
- Group A and group D are also flagged 1026 as linked.
- the optimally link groups process 250 then loops 1027 through ail non-overlapping groups that are not already linked, i.e., groups B, C and E.
- the optimally link groups process 250 locates 1028 the non-linked group that with the current head group has the largest mean cumulative distance value.
- the non-linked group B with the head group A has the largest mean cumulative distance value, i.e., AVE[0][1] equals 0.7.
- the optimally link groups process 250 sets 1028 the variable MAXCORR equal to AVE[0][1], i.e., 0.7.
- the optimally link groups process 250 also locates 1029 the group that with the current tail group has the largest mean cumulative distance value.
- the non-linked group C with the tail group D has the largest mean cumulative distance value, i.e., AVE[3][2] equals 0.6.
- the optimally link groups process 250 sets 1029 the variable MINCORR equal to AVE[3][2], i.e., 0.6.
- the optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR. It is 1033, as MAXCORR is now equal to 0.7 and MINCORR is equal to 0.6, so the optimally link groups process 250 links 1031 the non-linked group B associated with MAXCORR as the new head group.
- the original head group, i.e., group A, is linked to the new head group B. At this time, the link of groups is as shown in Equation 12.
- Equation 12 group B is the head of the link and group D remains the tail of the link.
- Group B the new head group, is also flagged 1031 as linked.
- the optimally link groups process 250 then checks 1035 whether all groups have been linked. They have not 1037; groups C and E remain to be linked.
- the optimally link groups process 250 once again loops 1027 through all non-overlapping, non-linked groups, i.e., groups C and E.
- the optimally link groups process 250 locates 1028 the group that with the current head group, i.e., group B, has the largest mean cumulative distance value.
- group B has the largest mean cumulative distance value.
- the non-linked group C with the head group B has the largest mean cumulative distance value, i.e., AVE[1][2] equals 0.8.
- the optimally link groups process 250 sets 1028 the variable MAXCORR to 0.8.
- the optimally link groups process 250 also locates 1029 the group that with the current tail group, i.e., group D, has the largest mean cumulative distance value.
- group D has the largest mean cumulative distance value.
- the non-linked group C with the tail group D has the largest mean cumulative distance value, i.e., AVE[3][2] equals 0.6.
- the optimally link groups process 250 sets 1029 the variable MINCORR to 0.6.
- the optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR. It is 1033, as MAXCORR is now equal to 0.8 and MINCORR is equal to 0.6, so the optimally link groups process 250 links 1031 the non-linked group C associated with MAXCORR as the new head group.
- the original head group, i.e., group B, is linked to the new head group C. At this time, the link of groups is as shown in Equation 13.
- Equation 13 group C is the head of the link and group D remains the tail of the link.
- Group C is also flagged 1031 as linked.
- the optimally link groups process 250 then checks 1035 whether all groups have been linked. They have not 1037; group E still remains to be linked. Thus, the optimally link groups process 250 once again loops 1027 through all non-linked, non-overlapping groups, i.e., group E.
- variable MAXCORR is set to the average correlation value for the head group C - group E pair, i.e., AVE[2][4] equal to 0.5.
- variable MINCORR is set to the average correlation value for the tail group D - group E pair, i.e., AVE[3][4] equal to 0.5.
- the optimally link groups process 250 then checks 1030 whether MAXCORR is greater than MINCORR.
- Group E is also flagged 1032 as linked.
- the optimally link groups process 250 then checks 1035 whether all groups have been linked. They are 1036, so the optimally link groups process 250 optimally links 1040 the compounds in all the linked groups, as further described below. The optimally link groups process 250 then ends 1038.
- an optimally link groups process 2020 first writes 1079, or otherwise copies, the mirror image of the top half of the average correlation, or mean cumulative distance, value array, e.g., AVG, to the bottom half of the array.
- the optimally link groups process 2020 then loops 1080 through all non- overlapping compound groups, i.e., all groups that are not marked as overlapping, and locates the two groups with the largest average correlation value, or mean cumulative distance value.
- the optimally link groups process 2020 sets 1080 a variable, e.g., TOK1 , to group I and sets 1080 a second variable, e.g., TOK2, to group J.
- the TOK1 group is also set 1080 as the original head of the link of groups and the TOK2 group is set 1080 as the original tail of the link of groups.
- the optimally link groups process 2020 also flags 1081 the TOK1 group and the TOK2 group as linked.
- the optimally link groups process 2020 then loops 1082 through all non- overlapping, non-linked groups I.
- the optimally link groups process 2020 checks 1083 whether all groups have been linked. If no 1086, the optimally link groups process 2020 locates 1085 the TOK1 head group - non-linked group I pair with the largest mean cumulative distance value, e.g., the largest value of AVE[TOK1][l].
- the optimally link groups process 2020 sets 1085 a variable, e.g., MAXCORR, to the largest value of AVE[TOK1][l].
- the optimally link groups process 2020 uses the first I group as the group corresponding to MAXCORR. For example, if the mean cumulative distance value for the TOK1 - group four pair is 0.9 and the mean cumulative distance value for the TOK1 - group seven pair is also 0.9, and 0.9 is the largest mean cumulative distance value for all TOK1 group - non-linked group I pairs, then the optimally link groups process 2020 sets MAXCORR equal to 0.9 and associates the non-linked fourth group with MAXCORR.
- the optimally link groups process 2020 also locates 1087 the TOK2 tail group - non-linked group I pair with the largest mean cumulative distance value, e.g., the largest value of AVE[TOK2][l].
- the optimally link groups process 2020 sets 1087 a variable, e.g., MINCORR, to the largest value of AVE[TOK2][l].
- the optimally link groups process 2020 uses, or otherwise associates, the first I group as the group corresponding to MINCORR. Once the optimally link groups process 2020 locates a current MAXCORR and a current MINCORR, it checks 1090 whether MAXCORR is greater than MINCORR. If yes 1088, the optimally link groups process 2020 sets 1091 the non- linked group I associated with MAXCORR as the new head group. The previous head group, TOK1 , is linked to the new head group I. The new head group I is also flagged 1093 as linked. The variable TOK1 is also set 1095 to the new head group I. The optimally link groups process 2020 then loops 1082 once again through all the non-overlapping, non-linked groups.
- the optimally link groups process 2020 sets 1092 the non-linked group I associated with MINCORR as the new tail group.
- the previous tail group, TOK2 is linked to the new tail group I.
- the new tail group I is also flagged 1094 as linked.
- the variable TOK2 is also set 1096 to the new tail group I.
- the optimally link groups process 2020 then loops 1082 once again through all the non-overlapping, non- linked groups.
- the optimally link groups process 2020 loops H 1097 times, once for each group in the link of groups.
- the optimally link groups process 2020 loops 1097 from the head group to the tail group of the link of groups. Once all the groups H have been looped 1097 through, the optimally link groups process 2020 is ended 2021.
- the optimally link groups process 2020 first checks 1098 whether the current group H is the head group of the link of groups. If it is 2000, the optimally link groups process 2020 sets 2001 a first variable, e.g., A, to the correlation value of the first compound in the head group and the first compound in the second group in the link of groups. With DATA as the array of correlation values for all compound pairs to be clustered, A is set to DATA[1 st compound in Head group][1 sl compound in 2 nd group]. The optimally link groups process 2020 also sets 2001 a second variable, e.g., B, to the correlation value of the first compound in the head group and the last compound in the second group in the link of groups. Thus, B is set to DATA[1 st compound in Head group][Last compound in 2 nd group].
- A the correlation value of the first compound in the head group and the first compound in the second group in the link of groups.
- B is set to DATA[1 st compound in Head group][Last
- the optimally link groups process 2020 also sets 2001 a third variable, e.g., C, to the correlation value of the last compound in the head group and the first compound in the second group in the link of groups.
- C is set to DATA[Last compound in Head group][1 st compound in 2 nd group].
- the optimally link groups process 2020 also sets 2001 a fourth variable, e.g., D, to the correlation value of the last compound in the head group and the last compound in the second group in the link of groups.
- D is set to DATA[Last compound in Head group][Last compound in 2 nd group].
- the optimally link groups process 2020 then checks 2002 whether C or D is greater than or equal to A and B. Thus, the optimally link groups process 2020 checks 2002 whether either of the correlation values of the head group tail compound - second linked group head and tail compound pairs are greater than or equal to both the correlation values of the head group head compound -second group head and tail compound pairs. If the head group tail compound generates the same or larger correlation value 2003, the compounds in the head group are stored in a table, or list, e.g., LIST, of compounds, in head to tail order. This is because the head group tail compound is more similar to, and, thus, should be linked closest to, the compounds in the second group.
- a table, or list, e.g., LIST e.g., LIST
- the compounds in the head group are stored in LIST in tail to head order. This is because the head group head compound is more similar to, and, thus, should be linked closest to, the compounds in the second group.
- 2020 loops 2005 through all the compounds in the head group, storing 2007 the compounds in LIST, in head to tail order. Once all the compounds in the head group are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
- the optimally link groups process 2020 loops 2006 through all the compounds in the head group, storing 2008 the compounds in LIST, in tail to head order. Once all the compounds in the head group are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
- the optimally link groups process 2020 checks 2009 whether the correlation value of the last compound in LIST, i.e., the last compound in the previous group H stored in
- the optimally link groups process 2020 checks whether DATAfLast compound in List][Head compound in current group H] is greater than or equal to DATA[Last compound in List][Tail compound in current group H].
- LIST has the same or larger correlation value 2011 , the compounds in group H are stored in LIST in head to tail order. This is because the head compound in group H is more similar to, and, thus, should be linked closest to, the last compound stored in
- the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
- the optimally link groups process 2020 loops 2012 through all the compounds in current group H, storing 2014 the compounds in LIST in tail to head order. Once all the compounds in the current group H are stored in LIST, the optimally link groups process 2020 loops 1097 to the next group in the link of groups.
- the optimally link groups process 2020 ends 2021.
- the pattern recognition oriented cluster process 200 executes a generate pattern matrix process 255.
- the generate pattern matrix process 255 generates and outputs, e.g., to an output file or a computer terminal screen, a pattern matrix of the grouped compounds of the set of compounds to be clustered.
- the generate pattern matrix process 255 creates a two-dimensional color-shaded cluster graph representative of the correlation values of the clustered, i.e., grouped, compounds.
- a generate pattern matrix process 255 as shown in Figure 29, the titles "class", "compound” and “Code” are printed 2025 in a first output file text row, or line.
- the generate pattern matrix process 255 then loops X 2026 times, once for each of the set of compounds to be clustered.
- the compounds X are looped through from the first compound stored in the table LIST to the last compound stored in the table LIST.
- compounds are stored in the table LIST in optimal order in the optimally link groups process 2020.
- the generate pattern matrix process 255 prints 2027 the respective compound code value in a unique column in the first output file text row, after the "Code" title.
- the generate pattern matrix process 255 also prints 2028 the respective compound class description, compound name and compound code value for compound X, under the respective titles.
- One compound class description, name and code value is printed per output file text row, or line, beginning with a second output file text line.
- the respective compound class description for compound X was previously stored in table DESC[X] when the input file for the set of compounds to be clustered was processed by the input data process 350.
- the respective compound name for compound X was previously stored in table NAME[X] when the input file for the set of compounds to be clustered was processed by the input data process 350.
- the respective compound code for compound X was previously stored in table CODE[X] when the input file for the set of compounds to be clustered was processed by the input data process 350.
- the generate pattern matrix process 255 only outputs the title "NAME" to an output file, in a first text line.
- the generate pattern matrix process 255 for each compound in the set of compounds to be clustered, prints only the respective compound name to the output file, one name per output file text line, or row, beginning with a second output file text line. Each compound name is correspondingly printed in a respective column in a first output text line, or row, of the output file.
- the generate pattern matrix process 255 prints the titles "mode", “structure”, “compound” and “code” in a first text line, or row, in an output file.
- the generate pattern matrix process 255 prints the respective compound code values for each compound in the set of compounds to be clustered, one per column, in the first text line in the output file.
- the generate pattern matrix process 255 also prints the respective compound mode, if there is one, compound structure, if there is one, and compound name for each compound in the set of compounds to be clustered, under the appropriate titles.
- One compound mode, structure and name information for one compound is printed per text line, or row, of the output file, beginning with a second text line.
- Exemplary cluster graph 2055, shown in Figure 30, is an example of a cluster graph generated in this alternative embodiment.
- the generate pattern matrix process 255 loops I times 2029, once for each compound in the set of compounds to be clustered.
- the generate pattern matrix process 255 then loops J times 2030, once for each compound in the set of compounds to be clustered.
- the generate pattern matrix process 255 loops J 2030 times equal to the number of compounds in the set of compounds to be clustered.
- the generate pattern matrix 255 determines 2031 which color group, e.g., COLOR[X], the correlation value for the compound I - compound J pair, e.g., DATA[I][J], is within.
- the color groups i.e., COLOR[X]
- COLOR[X] and their respective correlation value ranges were previously established during the set grouping parameters process 220.
- a presently preferred embodiment default number of color groups is six and a presently preferred embodiment default correlation value range for each color group is shown in Table 2.
- COLOR[X] group the DATA[I][J] correlation value is within, it prints 2032 to an output file a block of color corresponding to the respective COLOR[X] group.
- the generate pattern matrix process 255 also prints 2032 to an output file, in the respective block of color, the value of COLOR[X], i.e., the lower value of the range of correlation values for respective COLORfX].
- the block of color corresponding to COLORfX] and the respective value of COLORFX] are printed in the row compound I - column compound J of the output file.
- the generate pattern matrix process 255 then loops X 2034 times, once for each color group COLOR. For each COLOR[X] group, the generate pattern matrix process 255 prints 2036 to the output file a block of color corresponding to the respective COLOR[X] group. In a presently preferred embodiment, the generate pattern matrix process 255 also prints 2036 to the output file, in the respective block of color, the value of COLOR[X], i.e., the lower value of the range of correlation values for respective COLORfX]. The block of color and respective value of COLOR[X] are printed under the title "COLOR_CODE".
- the generate pattern matrix process 255 also prints 2037 to the output file the correlation range corresponding to the respective COLORfX] group.
- the correlation range for the COLORfX] group is printed under the title "CORRELATION”.
- the generate pattern matrix process 255 generates 2038 a carriage return before looping 2034 to the next COLOR[X] group.
- the generate pattern matrix process 255 is ended 2035.
- the pattern recognition oriented cluster process 200 is also ended 260.
- the methods and apparatus of the present invention provide versatile tools for evaluating sets of random objects and clustering them into groups based on predefined characteristic(s) of the objects.
- the random objects are chemical compounds and the predefined characteristic is similarity of biological activity.
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU25902/00A AU2590200A (en) | 1998-12-18 | 1999-12-17 | Pattern recognition oriented cluster analysis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11278698P | 1998-12-18 | 1998-12-18 | |
US60/112,786 | 1998-12-18 |
Publications (3)
Publication Number | Publication Date |
---|---|
WO2000036489A2 WO2000036489A2 (en) | 2000-06-22 |
WO2000036489A3 WO2000036489A3 (en) | 2000-11-30 |
WO2000036489A9 true WO2000036489A9 (en) | 2001-05-10 |
Family
ID=22345844
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1999/030175 WO2000036489A2 (en) | 1998-12-18 | 1999-12-17 | Pattern recognition oriented cluster analysis |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2590200A (en) |
WO (1) | WO2000036489A2 (en) |
-
1999
- 1999-12-17 WO PCT/US1999/030175 patent/WO2000036489A2/en active Application Filing
- 1999-12-17 AU AU25902/00A patent/AU2590200A/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
WO2000036489A2 (en) | 2000-06-22 |
AU2590200A (en) | 2000-07-03 |
WO2000036489A3 (en) | 2000-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hahne et al. | Unsupervised machine learning | |
Seo et al. | Interactively exploring hierarchical clustering results [gene identification] | |
US6466923B1 (en) | Method and apparatus for biomathematical pattern recognition | |
Frank et al. | Classification of images of biomolecular assemblies: a study of ribosomes and ribosomal subunits of Escherichia coli | |
EP1635277A2 (en) | System and methods for visualizing and manipulating multiple data values with graphical views of biological relationships | |
US20060098011A1 (en) | Method and apparatus for displaying information | |
US20020143472A1 (en) | Method and display for multivariate classification | |
Torkkola et al. | Self-organizing maps in mining gene expression data | |
US20040234995A1 (en) | System and method for storage and analysis of gene expression data | |
EP2410447B1 (en) | System and program for analyzing expression profile | |
Neale | Individual fit, heterogeneity, and missing data in multigroup sem | |
Murphy et al. | Robust classification of subcellular location patterns in fluorescence microscope images | |
Rao et al. | Partial correlation based variable selection approach for multivariate data classification methods | |
Cook et al. | Exploring gene expression data, using plots | |
WO2000036489A9 (en) | Pattern recognition oriented cluster analysis | |
Tasoulis et al. | Unsupervised clustering of bioinformatics data | |
US7010430B2 (en) | Method for displaying gene experiment data | |
Zhang et al. | VizCluster and its application on classifying gene expression data | |
JP3936851B2 (en) | Clustering result evaluation method and clustering result display method | |
Vehlow et al. | ihat: Interactive hierarchical aggregation table | |
WO2011033274A1 (en) | Apparatus and method for processing cell culture data | |
Zintzaras et al. | Growing a classification tree using the apparent misclassification rate | |
JP3773092B2 (en) | Gene expression pattern display method and apparatus, and recording medium | |
Ray et al. | Dynamic range-based distance measure for microarray expressions and a fast gene-ordering algorithm | |
Repsilber et al. | Developing and testing methods for microarray data analysis using an artificial life framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
AK | Designated states |
Kind code of ref document: C2 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: C2 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/38-38/38, DRAWINGS, REPLACED BY NEW PAGES 1/34-34/34; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase |