WO2001008039A2 - Analysis and pattern recognition in large, multidimensional data sets using low-resolution data grouping - Google Patents

Analysis and pattern recognition in large, multidimensional data sets using low-resolution data grouping Download PDF

Info

Publication number
WO2001008039A2
WO2001008039A2 PCT/US2000/020401 US0020401W WO0108039A2 WO 2001008039 A2 WO2001008039 A2 WO 2001008039A2 US 0020401 W US0020401 W US 0020401W WO 0108039 A2 WO0108039 A2 WO 0108039A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
user
grouping
scoring
rule
Prior art date
Application number
PCT/US2000/020401
Other languages
French (fr)
Other versions
WO2001008039A3 (en
WO2001008039A9 (en
Inventor
Charles L. Lerman
Original Assignee
Zeneca Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zeneca Limited filed Critical Zeneca Limited
Priority to EP00950750A priority Critical patent/EP1210670A2/en
Priority to JP2001513058A priority patent/JP2003505790A/en
Priority to AU63809/00A priority patent/AU6380900A/en
Publication of WO2001008039A2 publication Critical patent/WO2001008039A2/en
Publication of WO2001008039A3 publication Critical patent/WO2001008039A3/en
Publication of WO2001008039A9 publication Critical patent/WO2001008039A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • This invention relates to analysis and pattern recognition of data. More
  • this invention relates to methods, systems and devices and
  • HTS high-throughput screening
  • HTS uses automated, relatively low-cost techniques to obtain various items of
  • the goal of using HTS is to obtain
  • a central concept on which this invention is based is grouping data into a
  • the grouping is visualized by assigning colors to data groups, e.g., in spreadsheets.
  • Grouping of data potentially changes the precision of the data.
  • this invention provides mechanisms to
  • this invention has reduced the time to evaluate real data sets
  • the system of this invention includes a new special command menu, a set
  • spreadsheet-coloring tool is presently implemented with a flexible, powerful, and
  • this invention provides methods, systems and devices for
  • the method of this invention provides at least one user-
  • the data may be
  • the grouping rule applies to at least one user-selectable
  • the grouping rule defines breakpoints
  • the method may
  • the grouping rules associate colors with groups
  • grouping rule specifies at least one breakpoint and a corresponding color for each
  • the breakpoints may be numeric or textual values. In some embodiments,
  • the breakpoint is determined automatically based on the data.
  • the number of groups may be fewer than a number of possible data values.
  • this invention is a method of operating on data by
  • the grouping rule is applied to the data to generate
  • At least one user-defined scoring rule is used to score grouped data
  • the scoring rule is applied to the grouped data
  • this invention is a method of operating on data, in
  • the rule for grouping the data into a user-definable number of groups are scored by applying to the grouped data at least one user-defined scoring
  • the scoring rule comprises a scoring function of user-
  • the scoring applies the function to the data to obtain a
  • the method includes sorting the scored cases by
  • the clustering column data, known as a clustering column.
  • integrated clusters are treated by averaging the properties of all the cases which
  • data is potentially changed (implemented, e.g., by grouping the data) and then the
  • data are presented for visualization (implemented, e.g., by coloring the data).
  • FIGURE 1 shows a typical computer system on which the present invention
  • FIGURE 2 shows an overview of the functionality of the present invention
  • FIGURES 3A-3B depict a display of data in a spreadsheet
  • FIGURES 4A-4B show a color control rules worksheet according to one
  • FIGURES 5A-5B show data coloring rules
  • FIGURES 6A-6C show a data coloring control panel and a flow chart of the
  • FIGURES 7A-8B show the worksheet of FIGURE 3 A and 3B after various
  • FIGURES 9A, 9B, lOA, and 10B depict displays of data in spreadsheets
  • FIGURES 11A and 11B show the form of the cluster control worksheet
  • FIGURES 11C-11D shows control panels from the cluster control
  • FIGURE 12 shows the enlarging of the cluster starts mechanism according
  • FIGURES 13A-13D show the application of vertical display re-scaling
  • FIGURES 14A-14D and 15A-15B show the application of the scoring
  • FIGURES 16A-16N, 16P and 16Q show aspects of the application of the
  • FIGURES 17A-17B show the application of the sheet statistics tool
  • FIGURES 18A-18D show the application of the scoring and sorting of
  • FIGURES 19-24 show examples of the application of this invention to
  • FIGURES 25 and 26 show application of an aspect of this invention.
  • FIGURE 1 shows a typical computer system 100 on which the present
  • the computer system 100 includes a processor (CPU) 102
  • the computer system also includes a memory system 104 and a display 106.
  • the computer system also includes a memory system 104 and a display 106.
  • the computer system also includes a display 106.
  • keyboard 108 includes various input devices including a keyboard 108 and a mouse 110 or other
  • Internal storage 112 e.g., a hard disk, a CD ROM and the like.
  • external storage 114 (such as a floppy disk, CD ROM and the like) are also examples.
  • the programs 116 can reside in the internal storage 112, the external storage 114, and/or in the memory 104.
  • the software programs 116 operate on data 118 which
  • the software programs 116 operate
  • the user can create and
  • the computer system 100 is capable of running
  • EXCEL the spreadsheet program EXCELTM 95 (hereinafter "EXCEL") from Microsoft
  • the package self-
  • this invention works entirely within the
  • EXCEL structures data files as workbook files which
  • the programs 116 of this invention consist of special EXCEL
  • control sheets on which input data is written by the user into
  • control sheets are part of the same EXCEL workbook
  • control sheets also contain action buttons to execute the
  • the rules 120 are formed by
  • control sheets stored on the control sheets along with the data, and they can be modified and/or
  • the system according to the present invention operates, in one aspect, in
  • the rules 120 can be created before or after the data 118, rules can
  • the user can then perform group/color-mediated data mining (at 128).
  • FIGURES 3A-3B show views of the program of this invention in operation
  • the data on the "DEMO 1 " sheet 300 consists of eight columns of data for each of a number of compounds.
  • FIGURES 3 A and 3B has seven other worksheets, denoted "DEMO 2" 304;
  • DEMO DEMO 308 correspond to data 118 (FIGURE 1) and the controls or rules in the
  • Control correspond to the rules 120 (FIGURE 1).
  • FIGURES 4A-4B show a color control rules worksheet (312, denoted "color
  • a typical data coloring rule 130 is shown in
  • FIGURE 5A The rule 130 has already been set up and operates on the appropriate
  • FIGURE 5A Actually, as explained below, the number of breakpoints is one less
  • colors 142 namely light green, yellow, orange and red.
  • the coloring is implemented by showing a
  • the actual text representing the data is shown in the appropriate color.
  • the font color is only changed in cases where
  • font colors dark (black) and light (pale gray), are used in the presently preferred
  • break 2 are colored yellow (color 2); values in the range 5 to 10 (break 2 to
  • break 3 are colored orange (color 3); and values greater than 10 (break 3) are
  • Another typical data coloring rule 130-1 from the color control sheet 312 is Another typical data coloring rule 130-1 from the color control sheet 312
  • FIGURE 5B The rule 130-1 is set up to operate on columns "C" and
  • FIGURES 7A-7B are shown in FIGURES 7A-7B. As can be seen from
  • FIGURES 7A-7B after application of the rule 130-1, all of the data in columns C
  • FIGURES 4A-4B to the data in sheet "DEMO 1" are shown in FIGURES 8A-8B.
  • FIGURES 7A-7B or they can be all be applied at the same time. In order to apply
  • each rule can be individually selected or
  • FIGURE 6B is a diagrammatic representation of FIGURE 6B
  • A:D,F means columns A,B,C,D, and F.
  • the user can right-click on the cell containing
  • the cell labeled "# of colors" the cell labeled "# of colors".
  • breakpoints i.e., the boundaries between the
  • breakpoint 1 is colored with
  • This mode reports information about the breakpoints it determines, and thus could also be used to explore the distribution of numerical values
  • Text data enter the strings to be matched and colored, in preferred embodiments, up to five (5) in
  • breakpoint values. For example, if a rule says that
  • the system colors the cell according to the first condition satisfied on the
  • the sequence of conditions can be considered as a series of filters, through which only the as-yet-uncolored cells fall through to the next decision.
  • (6) Enter the names of the colors to use (at 610), in the order
  • a user can combine two or more panels to create a particular
  • the data coloring mechanism operates as follows, with reference to the
  • FIGURE 6A is a diagrammatic representation of FIGURE 6A.
  • the system can identify which button was clicked using the Visual Basic ("VB")
  • the system retrieves the parameters (sheet name, column specification,
  • numeric v. text or manual v. automatic and, if automatic, which of value, log or
  • Each coloring rule is provided in a coloring control
  • FIGURE 6A panel that has the general form of a coloring rule as shown in FIGURE 6A.
  • each coloring control panel 144 is a double-outlined unit
  • the coloring control panel 144 to set the sheet and column(s) on which the rule is
  • the sheet is set by entering its name into the cell 146
  • selecting the cell 150 causes the value in the cell to cycle
  • CallColorColumn function extracts the button name of the cell 152 and then calls a second function ( ⁇ CallColorColumnSubroutine") with that button name as one
  • the function CallColorColumnSubroutine takes three parameters, namely ButtonName, StartColoringRow, and FinishColoringRow.
  • StartColoringRow The two parameters StartColoringRow, and FinishColoringRow are optional.
  • the values are all in fixed positions relative to the selected button cell 152 that initiated the call to the function CallColorColumn, the values can be determined
  • the column(s) to be colored are specified by:
  • ncolors The number of colors is specified by the variable ncolors, where:
  • ncolors Cells (headingrow + 3, datacol) .Value
  • break and color which are used to store the breakpoints and colors
  • the first breakpoint is set as follows:
  • the value of the first breakpoint is used to determine the break mode
  • break mode is set to "TEXT".
  • the function determines whether multiple columns were specified, in
  • rescale_all_string Cells (headingrow + 15, datacol) .Value
  • numeric breakpoints are not in increasing order, the user is notified.
  • the CallColorColumnSubroutine maintains an array, colorname, which
  • the program next associates the user
  • variable ⁇ ranges from 1 to ncolors.
  • the designated column is colored from the row corresponding to
  • breakmode is "VALUE” or "LOG” and the value of rescale all is set to “True” Or the value of the first breakpoint (break(l)) is set to "VALUE” or "LOG", the program calls the function AutoValueBreakpoints as follows:
  • breakmode is set to "COUNT” and the value of rescale all is set to "True” or the first breakpoint (break(l)) is set to "COUNT"
  • AutoCountBreakpoints (colnum, colletter, ncolors, break, Color, breakmode, rescale_all, sheetname) .
  • breakmode is any breakpoints specified by the user. Specifically, when the breakmode is any
  • invention provides five mechanisms (each discussed below) for color-mediated
  • FIGURES 11A-11B are shown in FIGURES 11A-11B.
  • cluster in clustered data by enlarging the font of the cell containing the cluster number or label, thus enabling size reduction of the spreadsheet for the user to focus on the color patterns.
  • cluster numbers or textual labels Any column can be designated as the cluster labels to be processed. Operation of the mechanism is as follows:
  • FIGURE 12 which shows the application of a rule (shown in the control panel
  • FIGURE 11C from the cluster control worksheet in FIGURE 11B to the data of
  • the vertical re-scaling mechanism operates by taking a user-provided scale
  • FIGURES 13A-13D show the application of the vertical display re-scale
  • FIGURES 13A-13B show some of
  • FIGURE 13B shows the remaining elements of that
  • the worksheet "DEMO 3" 306 has three hundred and twenty eight (328) data entries (in rows 2-329). The user can vertically scale the display by selecting "Re-scale Vertical" from the system's
  • FIGURE 13C which asks the user to enter a scaling factor relative to the current size.
  • the user enters a scaling factor to enlarge or reduce or restore the display.
  • the user enters a scaling factor of 0.1
  • FIGURE 13D the height of each row (except the heading rows) has been scaled by
  • Scoring and sorting clusters assign numerical scores to the color patterns
  • the scoring and sorting mechanism accepts user's designations of colors
  • Any column can be designated as the cluster labels to be processed.
  • Cluster Control worksheet 314 User input is taken from a Cluster Control worksheet 314 (see
  • FIGURES 11 A and 11B which stores any number of parameter sets, each one
  • the input data is automatically sorted by cluster label before starting, in
  • the first derived sheet is for the numerical scores
  • the second is like the original, but has the clusters sorted into descending score order, so that the "best" are at the top, removing the need to visually scan a long colored worksheet.
  • the derived output sheets have names that indicate their source data
  • the system reversibly hides the un-scored columns in the cluster-sorted output sheet,
  • cluster labels then the system compares single compounds rather than clusters.
  • the mechanism operates as follows, with reference to FIGURES 11A-11C.
  • panel 1100 to use (a panel which is empty or one containing non- needed inputs).
  • a single control panel extends vertically through the black, blue, red, and green sections, and provides input information for several tools.
  • the scores are arbitrary and relative; they will be normalized by the system as necessary. However, a user should be sure always to assign higher point scores to colors which denote favorable values, and
  • the cells with entries need not be colored, and need not be in score order,
  • uncolored cells which are most likely blank, i.e., unknown data
  • the user may decide that some of the colored groups
  • weights are arbitrary and relative; they will be scaled by the system
  • the two modes are:
  • the "SORTED" sheet contains a
  • FIGURE 11D which is taken from the cluster control worksheet shown in
  • FIGURE 11 A As shown in FIGURE 11 A, the parameters are stored with the name
  • FIGURE 11D give the color red a score of "-1", orange has a score of "0", yellow
  • the first added sheet contains
  • the second added sheet has clusters sorted according to the one auto mode chosen when the tool ran.
  • the user can selectively unhide certain columns by using the "Edi GoTo" menu
  • FIGURES 8A-8B the cluster control worksheet shown in FIGURE 11 A, and the
  • control panel "Cmpd" of FIGURE 11D by selecting "Score
  • the system first presents a dialog box (1402 shown in
  • FIGURE 14A asking the user how un-colored cells should be scored for sorting.
  • un-colored cells can be scored explicitly by user entries
  • the system provides a
  • FIGURES 14C-14D show the data in the newly created workbook.
  • FIGURES 10A-10B the cluster control worksheet shown in FIGURE 11 A, and the
  • FIGURES 15A-15B SCORES by acids
  • HTS lab is currently able to provide dose-response data on some subset of the
  • present invention includes a mechanisms/algorithms for analyzing percent-of-
  • the mechanism recognizes two properties of the dose-response data for
  • this invention includes an algorithm to assign numerical scores for dose-
  • estimates of IC 50 for the compounds can be generated by interpolating between the markers in the ordered list of composite
  • Cluster Control worksheet e.g., as shown in
  • FIGURES 11A-11B shows a control panel from the cluster control
  • worksheet "DEMO 1" for scoring dose-response.
  • the system detects uncolored data, notifies the user, and asks whether to continue. If yes, the system skips the row containing the uncolored data.
  • the system offers to regenerate existing table of Sheet Statistics to correct it for newly added score columns. Further, the system offers to sort the
  • the system also offers to carry out quantitative
  • the columns should be ordered the opposite way (lowest concentration left, highest right).
  • control panel 1100 to use.
  • a single control panel extends vertically through the black, blue, red, and green sections, and provides input information for several tools.
  • dose-response data (using the same syntax as for Data Coloring).
  • the data columns are ordered left to right, by decreasing concentration.
  • the scoring algorithm awards positive score points for each dilution step across the data that actually shows a decrease in the activity data group (i.e., the color), and to penalize every step that does not.
  • the algorithm uses the following
  • C number of colors used, i.e., number of data value groups
  • P number of points measured, i.e., number of different concentrations (doses) tested
  • FIGURE 16F shows artificial data and processing for twenty seven (27)
  • FIG.URE 16G the complete set in order of decreasing dose-responsiveness.
  • the second property of interest to be extracted from the data is the overall
  • the data value groups' ordinal index numbers are used as single-point
  • chemists also want a single index of compound quality derived from the dose-
  • the activity weight WA varies with the activity score SA in such a way as
  • FIGURE 161 shows the variation for a few values of k.
  • FIGURE 16J shows all three scores for the example complete set, now sorted by
  • the set included thirty one (31) compounds, with potencies evenly spaced
  • test set includes both "very active” and “very inactive” compounds relative
  • Pinhib 100 • (ligand) / [K + (ligand)]
  • the method uses calibration marker compounds.
  • Protein chemists solve the molecular weight problem by running marker
  • the system asks the user to input the concentrations
  • the marker data are then colored by the same rule used for the user's
  • potencies, and the potencies of the other compounds can be estimated by interpolating between the markers, using the composite dose-response scores.
  • FIGURES 16N and 16P show that the estimates are clearly quite good
  • TABLE 7 shows that the method successfully estimates the potencies within about a factor of two, even with high noise levels.
  • ORIGIN is a data analysis program from Microcal Software, Inc. of Northampton, Massachusetts. ORIGIN is used in this instance for non-linear least-squares

Abstract

Methods, systems and devices for operating on data provide at least one user-defined grouping rule for grouping the data into a user-definable number of groups; and apply at least one of the grouping rules to the data. The data may be in a table, wherein the at least one grouping rule applies to at least one user-selectable column of the table. The grouping rule defines breakpoints corresponding to the user-definable number of groups, and application of the at least one rule to the data divides the data into groups based on the breakpoints. The grouped data is presented in a manner that visually distinguishes the groups, sometimes by coloring an aspect of the data according to the rules.

Description

ANALYSIS AND PATTERN RECOGNITION IN LARGE, MULTIDIMENSIONAL DATA SETS USING LOW-RESOLUTION DATA GROUPING
A portion of the disclosure of this patent document contains material
which is subject to copyright protection. The copyright owner has no objection to
the facsimile reproduction by anyone of the patent document or the patent
disclosure, as it appears in the Patent and Trademark Office patent file or records,
but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
This invention relates to analysis and pattern recognition of data. More
particularly, this invention relates to methods, systems and devices and
combinations thereof for analysis and pattern recognition in large sets of
multidimensional data using low-resolution data grouping.
2. Background
With the advent of computerization and the low cost of data storage and
acquisition, people in many endeavors are now accumulating very large sets of
data. For example, scientists in drug and chemical companies now use automation
to perform so-called high-throughput screening ("HTS") of chemical compounds.
HTS uses automated, relatively low-cost techniques to obtain various items of
information about chemical compounds. The goal of using HTS is to obtain
information about a very large number of compounds in a quick and relatively
low-cost manner. Having accumulated a very large HTS data set, it is necessary
to evaluate the data in order to determine which, if any, of the analyzed compounds warrants further investigation. However, the results of such HTS tend to be very large sets of multidimensional data, on the order of thousands of rows
and dozens of columns, and so it is very difficult to make decisions just by
looking at the data. In addition to the very large amounts of data produced by HTS, difficulties in existing data handling and analysis methods include the following:
• Data comes from very diverse sources, including HTS laboratories, physical measurements, bio-scientists' laboratories, various computational software programs, etc., and the different sources tend to have very diverse kinds of output including numbers, text, mixed data types, error notations,
blank data, replicate data (more than one value per compound), etc.
• Not all sources produce data on the same list of compounds, or in the same order.
• Some data values are misleadingly too precise, i.e., have high relative experimental errors or noise, and can easily be over-interpreted.
• Medicinal chemists have to weigh very different kinds of factors (for
example, molecular weight vs. dose-responsiveness vs. ClogP vs.
secondary biology vs. selectivity across screens) in trying to determine
which are the best compounds or clusters of compounds to which to
devote further work. SUMMARY OF THE INVENTION
This invention solves the above and other problems by providing
automated tools to help with and speed up these data handling and analysis
processes. These tools embody some assumptions about how the data should be
treated by internalizing the most generally acceptable assumptions, but leaving
more idiosyncratic decisions to individual users.
A central concept on which this invention is based is grouping data into a
relatively small number of categories using low-resolution data grouping. The grouping is visualized by assigning colors to data groups, e.g., in spreadsheets.
Grouping of data potentially changes the precision of the data.
This categorization of data has several major benefits, including:
• creating a visual means of finding data patterns;
• beneficially blurring small variations in numerical data that are, in
practice, excessively fine distinctions, possibly due to experimental
error; and
• providing, in the colors themselves, a means or "common currency" to
evaluate candidates across a wide range of data types.
Accordingly, in one aspect, this invention provides mechanisms to
expedite pattern recognition in large sets of multidimensional data, such as those
that chemists assemble when evaluating hits from high-throughput screening
(HTS) and deciding which ones will get priority for further investigation. In
controlled trials, this invention has reduced the time to evaluate real data sets,
from days of intense human effort, which is vulnerable to errors due to volume or
fatigue, to a few minutes of automation with graphical presentation of results. It quickly becomes obvious upon using the system that the tools also have
value in data-handling areas other than HTS. Examples include selection and
management of any kind of tabulated data, e.g., portfolio management for any
kind of rated portfolios, selection of drug candidate compounds, selection and
management of proteins that are candidates for targets for drugs, selection and
management of research projects competing for resources, and evaluating
employee performance or job candidates.
The system of this invention includes a new special command menu, a set
of graphical user interface worksheets, and action buttons to facilitate the coloring
and color analysis processes for the user. While the central process is the data
grouping and coloring, there are also new tools for the upstream, or pre-grouping
and coloring processes of importing, assembling, regularizing, and characterizing
data in a spreadsheet, and for the downstream processes of visualizing, scoring,
comparing, and sorting large amounts of color-coded data. The data-grouping and
spreadsheet-coloring tool is presently implemented with a flexible, powerful, and
convenient user interface that does not require knowledge of spreadsheet macros
or of the Visual Basic language (used for the system's implementation).
Accordingly, this invention provides methods, systems and devices for
operating on data.
In one aspect, the method of this invention provides at least one user-
defined grouping rule for grouping the data into a user-definable number of
groups. At least one of the grouping rules is applied to the data. The data may be
provided in a table and the grouping rule applies to at least one user-selectable
column of the table. In some embodiments, the grouping rule defines breakpoints
corresponding to the user-definable number of groups. Application of the rule the data divides the data into groups based on the breakpoints. The method may
include presenting the grouped data in a manner that visually distinguishes the
groups. In some embodiments, the grouping rules associate colors with groups
and the grouped data is presented with an aspect of the data colored according to
the rules.
Sometimes the data are in labeled columns in a spreadsheet, and the
grouping rule specifies at least one breakpoint and a corresponding color for each
range defined by the breakpoint. The grouped data are presented by coloring each
data item in one labeled column of the data based on the breakpoint and the
corresponding color of the breakpoint.
The breakpoints may be numeric or textual values. In some embodiments,
the breakpoint is determined automatically based on the data.
Sometimes the data are provided in a table, and backgrounds of table cells
are colored according to the rules.
The number of groups may be fewer than a number of possible data values.
In another aspect, this invention is a method of operating on data by
providing at least one user-defined grouping rule for grouping the data into a user-
definable number of groups. The grouping rule is applied to the data to generate
grouped data. At least one user-defined scoring rule is used to score grouped data
according to user-defined scores. The scoring rule is applied to the grouped data
to score the grouped data.
In yet another aspect, this invention is a method of operating on data, in
which data are grouped by applying to the data at least one user-defined grouping
rule for grouping the data into a user-definable number of groups. The grouped data are scored by applying to the grouped data at least one user-defined scoring
rule for scoring the grouped data according to user-defined scores.
In some embodiments the data can be a number of parameters for each of a
number of cases and the scoring rule comprises a scoring function of user-
selectable parameters and user-defined weights for the selected parameters to be
used in scoring the cases. The scoring applies the function to the data to obtain a
score for each case. Sometimes the method includes sorting the scored cases by
score, individually or by cluster, as described below.
The notion of clustering is that subsets of the various cases may be
associated into clusters by having identical entries in any user-selected column of
data, known as a clustering column. In some embodiments of the invention, the
integrated clusters are treated by averaging the properties of all the cases which
comprise each cluster.
Thus, according to aspects of this invention, in order to facilitate analysis
and pattern recognition in large, multidimensional data sets, the precision of the
data is potentially changed (implemented, e.g., by grouping the data) and then the
data are presented for visualization (implemented, e.g., by coloring the data).
BRIEF DESCRIPTION OF THE DRAWINGS
This file contains at least one drawing executed in color. Copies of this
patent with color drawings will be provided by the United States Patent and
Trademark Office upon request and payment of the necessary fee.
The above and other objects and advantages of the invention will be
apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which the reference characters
refer to like parts throughout and in which:
FIGURE 1 shows a typical computer system on which the present invention
operates;
FIGURE 2 shows an overview of the functionality of the present invention;
FIGURES 3A-3B depict a display of data in a spreadsheet;
FIGURES 4A-4B show a color control rules worksheet according to one
embodiment of the present invention;
FIGURES 5A-5B show data coloring rules;
FIGURES 6A-6C show a data coloring control panel and a flow chart of the
data coloring process, respectively;
FIGURES 7A-8B show the worksheet of FIGURE 3 A and 3B after various
coloring rules in FIGURE 4A have been applied;
FIGURES 9A, 9B, lOA, and 10B depict displays of data in spreadsheets;
FIGURES 11A and 11B show the form of the cluster control worksheet
according to one embodiment of the present invention;
FIGURES 11C-11D shows control panels from the cluster control
worksheet of FIGURES 11A-11B;
FIGURE 12 shows the enlarging of the cluster starts mechanism according
to one embodiment of the present invention;
FIGURES 13A-13D show the application of vertical display re-scaling
according to one embodiment of the present invention;
FIGURES 14A-14D and 15A-15B show the application of the scoring and
sorting of clusters according to one embodiment of the present invention; FIGURES 16A-16N, 16P and 16Q show aspects of the application of the
dose-response scoring and estimation of potencies according to one embodiment
of the present invention;
FIGURES 17A-17B show the application of the sheet statistics tool
according to one embodiment of the present invention;
FIGURES 18A-18D show the application of the scoring and sorting of
clusters for the purpose of project prioritization and management according to one
embodiment of the present invention;
FIGURES 19-24 show examples of the application of this invention to
various types of data; and
FIGURES 25 and 26 show application of an aspect of this invention.
DETAILED DESCRIPTION OF THE PRESENTLY PREFERRED EXEMPLARY
EMBODIMENTS
Overview
FIGURE 1 shows a typical computer system 100 on which the present
invention operates. The computer system 100 includes a processor (CPU) 102
connected to a memory system 104 and a display 106. The computer system also
includes various input devices including a keyboard 108 and a mouse 110 or other
pointing device. Internal storage 112 (e.g., a hard disk, a CD ROM and the like)
and external storage 114 (such as a floppy disk, CD ROM and the like) are also
provided.
Various aspects of this invention are implemented as computer software
programs or algorithms 116 which run on the computer system 100. The software
programs 116 can reside in the internal storage 112, the external storage 114, and/or in the memory 104. The software programs 116 operate on data 118 which
is provided, e.g., on the external storage 114. The software programs 116 operate
in a standard and known manner by being executed on the processor 102 of the
computer system 100.
In some embodiments of the present invention, the user can create and
modify various executable rules 120 which can operate on the data 118. For the
sake only of explanation, the rules 120 are depicted separately from the data in the
figures. However, as explained in more detail below, some or all of the rules 120
can be part of the data 118.
In preferred embodiments, the computer system 100 is capable of running
the spreadsheet program EXCEL™ 95 (hereinafter "EXCEL") from Microsoft
Corporation, and the software computer programs 116 are written in Microsoft
Corporation's Visual Basic (hereinafter "VB") and are provided as an add-in to
EXCEL. A single copy of software thus serves all data files on a particular
machine. To conserve EXCEL resources, in some embodiments, the package self-
installs the add-in when the user opens a data file, and un-installs the add-in when
the last data file in memory is closed.
In a preferred embodiment, this invention works entirely within the
environment of EXCEL. EXCEL structures data files as workbook files which
contain worksheets. The programs 116 of this invention consist of special EXCEL
worksheets, called control sheets, on which input data is written by the user into
designated labeled cells. The control sheets are part of the same EXCEL workbook
file as the data. The control sheets also contain action buttons to execute the
various procedures associated with this invention. The rules 120 are formed by
setting various parameters in the control sheets. When the workbook file is saved, the parameters (for the rules 120) are
stored on the control sheets along with the data, and they can be modified and/or
re-executed at any time without having to re-enter anything. The results of
operations are automatically written as worksheets in the same workbook file,
providing a convenient, integrated data environment in a single file.
The system according to the present invention operates, in one aspect, in
accordance with FIGURE 2. Recall that the user's aim is to perform analysis and
pattern recognition in large, multidimensional data sets using (potentially low
resolution) data grouping. To this end, the user and/or the system will create rules
for coloring and presenting the data. First (at 122) a user creates and organizes the
data 118. Various tools (discussed below) are provided to aid in the creation and
organization of the data. Then (at 124) the user creates rules 120 for operating on
the data 118. The rules 120 can be created before or after the data 118, rules can
be reused for different sets of data and multiple rules can apply to the same data.
The creation and operation of rules are discussed in greater detail below. Once the
data 118 and the rules 120 are created, the user then selects some (or all) of the
rules to apply to the data (at 126). Specifically, the user groups and thereby colors
the data according to selected rules. With the data grouped and colored according
to the rules, the user can then perform group/color-mediated data mining (at 128).
FIGURES 3A-3B show views of the program of this invention in operation
with a sample EXCEL sheet 300, denoted "DEMO 1" (302) containing data (not all
the data in the sheet is visible). The views of EXCEL worksheets shown in the
various figures and examples that follow are the views that are presented on the
display 106 of the computer system 100. Sheets in an EXCEL workbook are
labeled with tabs at the bottom of the worksheet. The data on the "DEMO 1 " sheet 300 consists of eight columns of data for each of a number of compounds. The
compounds are denoted "Cmpdxx", where "JCJC" ranges from "01" to the number of
compounds. In FIGURE 3B, the last compound visible on the data sheet is
"Cmpd58". The eight columns are headed:
1. "Cmpd" (column A);
2. "Series" (column B);
3. "Testl" (column C);
4. "Test2" (column D);
5. "Test3" (column E);
6. "HTS SPA Dose-Resp % Inhib @3xlO-6M" (column F);
7. "HTS SPA Dose-Resp % Inhib @ Ixl0-6M" (column G);
8. "HTS SPA Dose-Resp % Inhib @ 3xlO-7M" (column H); and
9. "HTS SPA Dose-Resp % Inhib @ Ixl0-7M" (column I).
In addition to the "DEMO 1" worksheet 300, the EXCEL workbook shown
in FIGURES 3 A and 3B has seven other worksheets, denoted "DEMO 2" 304;
"DEMO 3" 306; "clusterinfo DEMO" 308; "Append Control" 310; "Color
Control" 312 and "Cluster Control" 314. The last three worksheets, denoted
respectively "Append Control"; "Color Control" and "Cluster Control," contain
various rules and controls (to be discussed below). The data in worksheets
denoted "DEMO 1 " 302; "DEMO 2" 304; "DEMO 3" 306; and "clusterinfo
DEMO" 308 correspond to data 118 (FIGURE 1) and the controls or rules in the
worksheets denoted "Append Control" 310; "Color Control" 312, and "Cluster
Control" correspond to the rules 120 (FIGURE 1).
FIGURES 4A-4B show a color control rules worksheet (312, denoted "color
control") according to the present invention, as displayed on display 106 of the computer system 100. The color control worksheet 312 is shown with some rules
already in place, i.e., having values set, and other rules left blank. These rules are
shown as examples only, and, as with any of the other types of rules, any or all of
the rules can be set by the user. A typical data coloring rule 130 is shown in
FIGURE 5A. The rule 130 has already been set up and operates on the appropriate
data when selected by a user (using mouse 110, FIGURE 1 or some other pointing
device) in the area 132 marked "Click here to run these". The rule 130 (as with all
of the preferred color control rules) has four parts, namely the name of the sheet
134 containing the data on which the rule is to operate ("DEMO 1" in the example
of FIGURE 5A); the columns 136 of data of the sheet on which the rule is to
operate ("E" in the example of FIGURE 5A); the number of colors 138 to be used
by the rule 130; a number of breakpoints 140 (denoted "break 1" to "break 4" in
the example of FIGURE 5A); and a corresponding number of colors 142 for each
range defined by the breakpoints (denoted "color 1" to color 4" in the example of
FIGURE 5A). Actually, as explained below, the number of breakpoints is one less
than the number of colors.. In the specific example shown in FIGURE 5 A, the rule
has three breakpoints of 1, 5 and 10, defining four ranges with four corresponding
colors 142, namely light green, yellow, orange and red. Preferably the named
colors are also depicted in the actual colors, so that, in this example, the
background of the word "lightgreen" is shown in light green, the background of
the word "yellow" is shown in yellow and so on.
In this invention it is preferable to show data and meta-data (headings etc.)
in color. In some embodiments, the coloring is implemented by showing a
background area of the text representing the data in the appropriate color.
Sometimes the actual text representing the data is shown in the appropriate color. In presently preferred embodiments, the font color is only changed in cases where
necessary to improve contrast with the background color for readability. Only two
font colors, dark (black) and light (pale gray), are used in the presently preferred
embodiment. Combinations of both approaches can be used. For example, the
background section of the word "yellow" is preferably shown in the color yellow.
It is also possible to show the word itself, i.e., the font, in the color yellow, as long
as that color is distinguishable from the background.
The particular rule 130 shown in FIGURES 4A and 5A, operates as follows,
when selected:
In sheet "DEMO 1" 302, in column E, values less than or equal to 1 (break
1) are colored light green (color 1); values in the range 1 to 5 (between break 1
and break 2) are colored yellow (color 2); values in the range 5 to 10 (break 2 to
break 3) are colored orange (color 3); and values greater than 10 (break 3) are
colored red (color 4).
Another typical data coloring rule 130-1 from the color control sheet 312
is shown in FIGURE 5B. The rule 130-1 is set up to operate on columns "C" and
"D" of sheet "DEMO 1". The rule 130-1 uses three (3) breakpoints (breakl=0.1,
break2=l and break3=5) defining four ranges with four (4) corresponding colors
("lightgreen", "yellow", "orange", and "red").
The results of applying the rule 130-1 of FIGURE 5B to the data in sheet
"DEMO 1" (302, FIGURE 3) are shown in FIGURES 7A-7B. As can be seen from
FIGURES 7A-7B, after application of the rule 130-1, all of the data in columns C
and D of the sheet labeled "DEMO 1" has been colored according to the rule.
Specifically, data having a value less than or equal to break 1 (0.1) have been
colored light green; data values in the range between break 1 and break 2 (0.1 to 1) have been colored yellow; data values in the range between break 2 and break 3
(1 to 5) have been colored orange; and data values greater than break 3 (5) have
been colored red.
The results of applying all of the other color control rules shown in
FIGURES 4A-4B to the data in sheet "DEMO 1" are shown in FIGURES 8A-8B.
The rules can be applied individually (as shown above with respect to
FIGURES 7A-7B), or they can be all be applied at the same time. In order to apply
all rules to a particular data set (sheet), each rule can be individually selected or
the area labeled "RE-RUN ALL RULES FOR SHEET NAMED DEMO 1" (on the right
side of FIGURE 4 A) can be selected. Note that if two rules apply to the same
column of the same sheet, the second rule run on that column will override the
first rule run on that column.
To create a coloring rule a user performs the following (with reference to
FIGURE 6B):
(1) Select the "COLOR CONTROL" sheet 312 and pick a control panel on
that sheet to use (an empty panel or one containing a rule no longer
needed) (at 600). All control panels on a sheet can be cleared by
clicking the button labeled "CLEAR ALL ENTRIES ON THIS SHEET"
(318 in FIGURE 4 A).
(2) In the selected control panel, enter the name of the sheet to be
colored (at 602).
(3) In the selected control panel, enter a column or columns (at 604).
For multiple columns, either list them separated by commas, or use
a colon or hyphen to denote ranges, or some combination. For
example, "A:D,F" means columns A,B,C,D, and F. To aid in choosing columns, the user can right-click on the cell containing
the name of the data sheet, and pick "Open Twin Screen" from the
shortcut menu that appears, to create a special dual display. This
also creates a "Close Twin Screen" button to go back.
(4) Choose a number of colors to use (at 606), either by entering the
number of colors or by repeatedly clicking the gray button adjacent
the cell labeled "# of colors". In preferred embodiments, the
system allows for five breakpoints and six colors per rule.
Accordingly, the numbers will cycle from 1 to 6, and various cells
below them will be blacked out accordingly.
(5) Enter the breakpoints that define the color groups (at 608), in any
of three modes:
a) Numeric data, manual mode: enter numbers to form
the breakpoints, i.e., the boundaries between the
color groups, one less than the number of colors, in
increasing numerical order. Cells whose values
exactly equal a breakpoint value will be colored
with the lower group (breakpoint 1 is colored with
color 1, etc.)
b) Numeric data, automatic mode: enter either "value",
"log", or "count" as the first breakpoint. If multiple
columns have been chosen, the user must also enter
"yes" or "no" opposite "Re-scale all?" at the bottom
of the panel, to indicate whether each column
should get its own auto-breakpoints, or whether the auto-breakpoints of the first column (first in list in
the rule, not first on the data sheet) should be used
for all.
This mode reports information about the breakpoints it determines, and thus could also be used to explore the distribution of numerical values
in a column prior to a final manual breakpoint
selection. c) Text data: enter the strings to be matched and colored, in preferred embodiments, up to five (5) in
number. Matching is case-insensitive unless the
string is enclosed in double quotes (" and "); otherwise, no quotation marks are necessary.
Several special text strings act as operators if entered as the first word in a rule cell:
Figure imgf000017_0001
Figure imgf000018_0001
Using quotes to force matching to be case-sensitive also
works with strings that follow an operator.
It is possible to construct a text-coloring rule in which certain cells may satisfy more than one of the
"breakpoint" values. For example, if a rule says that
"active" is colored green and "contains act" is colored red, then the word "active" in a cell would
satisfy both. In such cases, the system colors the cell according to the first condition satisfied on the
list of breakpoints. This dependence on the
order can be used advantageously to achieve complex coloring conditions. The sequence of conditions can be considered as a series of filters, through which only the as-yet-uncolored cells fall through to the next decision. (6) Enter the names of the colors to use (at 610), in the order
corresponding to the breakpoints. A display of color samples is
provided at the right side of the Color Control sheet 312. A user
need only enter the name, and the appropriate cell will become
colored when the tool is executed. If the user wants the color to
display immediately, he can copy and paste the sample cell into the
rule's color cell. A special pseudo-color named "SKIP" is used to
tell the system not to color the cells whose data falls in this group.
(7) When the rule has been created, the user executes the rule by
selecting the rule' s "CLICK HERE TO RUN THESE" button on the
panel filled in (at 612, FIGURE 6C).
(8) To create different coloring rules for other columns, repeat the
above in additional control panels. If the user runs out of control
panels, he can create more control panels by copying an existing
one and pasting it onto a blank section of the color control sheet.
To the extent that a single panel cannot hold all the requirements for a
particular rule, a user can combine two or more panels to create a particular
rule. For example, if a user needs ten (10) breakpoints, two panels can be
used.
With reference to the coloring rule is shown in FIGURE 6A, once the
rule has been set, a number of parameters are stored in the system. The
parameters are "sheet name" ("DEMO 3" in FIGURE 6A), column
specification, number of colors, array of breakpoints, array of colors, and
multicolumn scaling mode. The data coloring mechanism operates as follows, with reference to the
flowchart of FIGURE 6C:
1. The user enters the parameters into a rule panel on a "COLOR
CONTROL" worksheet 312, e.g., as described above with reference to the panel of
FIGURE 6A.
2. The user selects (clicks) the activation button (labeled "Click here
to run these") on that panel (at 612). This causes the system to:
(A) Read and interpret the parameters from the panel (at 614).
The system can identify which button was clicked using the Visual Basic ("VB")
"caller" property, described in more detail below. The parameters are then read
based on the identity of the cell location of the button using the VB "TopLeftCell"
property. The system retrieves the parameters (sheet name, column specification,
number of colors, array of breakpoints, array of colors, and multicolumn scaling
mode) from cells in this panel by relative reference to the button cell.
(B) Next, determine the mode of the coloring rule (at 616) (i.e.,
numeric v. text or manual v. automatic, and, if automatic, which of value, log or
count). This uses the analysis of the first breakpoint entry.
(C) Compile a list of the columns specified in the "column
specification" parameter (at 618). This is done by scanning the various areas
contained in the selection, as follows:
For Each singlearea In Selection. Areas For Each c In singlearea. Columns If Not CountEmpty Then lr = LastRowInColumn (c. Column)
End If If Not CountEmpty And lr = 0 Then 'skip this empty column ncols = ncols - 1 Else ' add this column to the list icol = icol + 1 colnumarray (icol) = c . Column End If Next c Next singlearea
(D) If an auto-breakpoint mode is being used (determined at
620), analyze the data values to determine the breakpoints (at 622). This is done
by:
(i) Collecting statistics on the data distribution in each
specified column; and
(ii) Calculating the automatic break points for the appropriate mode. For example, the auto-value breakpoints
are determined as follows:
If breakmode = "VALUE" Then interval = (maxvalue - minvalue) / ncolors break (0) = minvalue
For ibreak = 1 To ncolors - 1 break (ibreak) = break (ibreak - 1) + interval Next ibreak
(iii) Displaying the results for user approval or
cancellation.
(C) Loop through the cells in the chosen columns on the chosen
worksheet (at 624). (D) Compare each cell's value to the list of breakpoints (at
626). If the coloring rule is in text mode, use the meanings of the special breakpoint operators ("contains", "blank", asterisk "*"; or quotation
marks).
(E) When a match is found, apply the appropriate color (at
628).
The code below illustrates the processes (D) and (E) for numeric
breakpoints:
For Each cell In Range (Cells (StartColoringRow, colnum),
Cells (FinishColoringRow, colnum) ) cvalue = cell.Value colored = False If IsNumeric (cell .Value) Then
If Not IsEmpty (cell) Then
' (have to test both because EMPTY is numeric) For ibreak = 1 To ncolors - 1
If cvalue <= break (ibreak) Then If Color (ibreak) <> "SKIP" cell . Interior. Colorlndex =
Color (ibreak) Call TextContrast (cell) End If colored = True
Exit For End If Next ibreak If colored = False Then ' Not hi t yet? Must be top category, so:
If Color (ncolors) <> "SKIP" cell . Interior .Colorlndex = Color (ncolors) Call TextContrast (cell) End If colored = True End If End If
Else ' not numeric - just don ' t color i t End If Next cell
The operation of the data coloring tool of this invention will now be
described in greater detail. Each coloring rule is provided in a coloring control
panel that has the general form of a coloring rule as shown in FIGURE 6A. In one
preferred embodiment, each coloring control panel 144 is a double-outlined unit,
sixteen (16) cells high by two (2) cells wide. As noted above, a user is provided
with coloring control panels on the color control worksheet 312. A user can use
the coloring control panel 144 to set the sheet and column(s) on which the rule is
to operate, the number of colors, the various break points and the colors associated
with those breakpoints. The sheet is set by entering its name into the cell 146
adjacent the cell labeled "sheet". The column (or columns) on which the rule is to
operate is (are) set by entering its (their) name in the cell 148 adjacent the cell
labeled "column(s)". The number of colors is set by the user by selecting the cell
150 adjacent to the cell labeled "# of colors". Each time the cell 150 is selected it
increases the number of colors, up to a maximum of six (6), i.e., rotating through
the values 1 to 6. I.e., when the cell 150 shows a "6" and is selected, it reverts
back to "1". That is, selecting the cell 150 causes the value in the cell to cycle
from "1" to "6" and then back to "1".
A Visual Basic ("VB") macro function ("CallColorColumri''') is associated
with the top cell 152 of the control panel 144. When the cell 152 is selected by the user (with the mouse 110 or the like), the function associated with that cell is
executed by the computer (CPU 102). In the presently preferred embodiments, the
CallColorColumn function extracts the button name of the cell 152 and then calls a second function (^CallColorColumnSubroutine") with that button name as one
of the parameters. The function CallColorColumnSubroutine takes three parameters, namely ButtonName, StartColoringRow, and FinishColoringRow.
The two parameters StartColoringRow, and FinishColoringRow are optional.
First, the function CallColorColumnSubroutine determines what specific
values to use for the coloring by reading them from the control panel 144. Since
the values are all in fixed positions relative to the selected button cell 152 that initiated the call to the function CallColorColumn, the values can be determined
once the location of that button cell 152 has been determined. This is done using
the following Visual Basic code:
Sheets ("Color Control") .Activate headingrow =
ActiveShee .Buttons (ButtonName) .TopLeftCell .Row headingcol =
ActiveSheet .Buttons (ButtonName) .TopLeftCell . Column
Note that if the function CallColorColumnSubroutine was called from another sheet (not "Color Control") then this method will not find it.
The various parameter values are then read as follows:
Sheet name: datacol = headingcol + 1 sheetname = Trim (Cells (headingrow + 1, datacol) .Value) If there is no sheet named "sheetname" an error function is called.
Generally, in preferred embodiments, a great deal of error checking takes place to
ensure that the user is provided with a friendly and useable interface to the program. Most of the error checking is not mentioned in this description,
however, one skilled in the art would know what kinds of error checking to implement in order to provide a user-friendly working environment.
The column(s) to be colored are specified by:
Cells (headingrow + 2, datacol) .Value
The number of colors is specified by the variable ncolors, where:
ncolors = Cells (headingrow + 3, datacol) .Value
Within the function CallColorColumnSubroutine there are two arrays,
named break and color, which are used to store the breakpoints and colors,
respectively. The first breakpoint is set as follows:
break(l) = Cells (headingrow + 4, datacol) .Value
The value of the first breakpoint is used to determine the break mode
("NUMERIC", "VALUE", "LOG", or "COUNT"), libreak(l) (as determined
above) is numeric, then the mode is set to "NUMERIC", otherwise, if break(l) is
one of "VALUE", "LOG", or "COUNT", then the break mode is set to that mode,
otherwise the break mode is set to "TEXT". Next, the function determines whether multiple columns were specified, in
which case it determines whether the user selected to re-scale all the columns.
The user's re-scale selection is determined by: rescale_all_string = Cells (headingrow + 15, datacol) .Value
Now the rest of the breakpoints (if any) are read. If the break-mode is
"AUTO" then the breakpoints are set as follows:
For i = 2 To lastbreaknum break(i) = Cells (headingrow + 3 + i, datacol) .Value
Various possible errors are checked for. E.g., if any breakpoints are
missing (i.e., ή ' break(I) is empty, the user is notified. Also, if the break mode is
"NUMERIC" and non-numeric breakpoints are set, the user is notified. If
numeric breakpoints are not in increasing order, the user is notified. As noted
above, generally, in preferred embodiments of the present invention, a great deal
of error checking is performed on all user inputs to ensure that the values are
correct and consistent. Most error checking is omitted from this description.
The CallColorColumnSubroutine maintains an array, colorname, which
maps integers to colors. In preferred embodiments, there are fifty six (56) colors
available. To use the higher numbered colors, the computer's video card must be
set appropriately. Using the colorname array, the program next associates the user
provided color names with index numbers. Specifically, for each of the (up to six
in a preferred embodiment) colors specified, the user specifies an actual color
name, denoted cname. This name is determined for each color by:
For j = 1 To ncolors cname = Cells (headingrow + 8 + j, datacol) .Value The interior of each color-specifying cell is then colored by the appropriate
(selected) color by setting the color property (Interior. Colorlndex) of the cell:
Cells (headingrow + 8 + j, datacol) . Interior .Colorlndex = Color (j ) ,
where the value of the variable^ ranges from 1 to ncolors.
Then the cell is further processed by a function TextContrast.
Call TextContrast (Cells (headingrow + 8 + j , datacol ) ) With the parameters read and checked, the system is ready to process and
color the selected sheet (specified at cell 146 in FIGURE 6A). The selected
columns (specified in cell 148 in FIGURE 6A) in the selected sheet are processed
one-by-one by the following program code:
Call Parselnput ( InString, inspecif ier) For Each singlearea In Range ( inspecifier) . Areas
For Each c In singlearea . Columns colnum = c . Column
Call ProcessOneColumn ( colnum, ncolors , break ,
Color, breakmode , rescale_all , sheetname ,
StartColoringRow , FinishColoringRow) Next c Next singlearea
The processing performed by the function ProcessOneColumn is as
follows: The column designated by colnum on sheet sheetname is to be colored
according to the breakpoints in the array break and the colors in the array colors.
The designated column is colored from the row corresponding to
StartColoringRow to the row corresponding to FinishColoringRow. Note that the function ProcessOneColumn is also provided with the break mode and the
variable rescale _all
Function ProcessOneColumn first calculates the automatic breakpoints, if necessary. Note that automatic breakpoints are determined from the whole column, even if this call says to color only a limited range of rows. If the value of
breakmode is "VALUE" or "LOG" and the value of rescale all is set to "True" Or the value of the first breakpoint (break(l)) is set to "VALUE" or "LOG", the program calls the function AutoValueBreakpoints as follows:
Call AutoValueBreakpoints (colnum, colletter, ncolors, break, Color, breakmode, rescale_all) .
Otherwise, if the breakmode is set to "COUNT" and the value of rescale all is set to "True" or the first breakpoint (break(l)) is set to "COUNT",
then the program calls the function AutoCountBreakpoints, as follows: Call AutoCountBreakpoints (colnum, colletter, ncolors, break, Color, breakmode, rescale_all, sheetname) .
With the breakpoints calculated, the columns are colored according to the
type of breakpoints specified by the user. Specifically, when the breakmode is any
one of "VALUE", "COUNT", "LOG", or "NUMERIC", the system executes a
function ColorNumericColumn. On the other hand, when the breakmode is
"TEXT", the system executes a function ColorNumericColumn. The VB code for
this is as follows:
Select Case breakmode
Case "VALUE", "COUNT", "LOG", "NUMERIC" Call ColorNumericColumn (colletter, ncolors, break, Color, StartColoringRow, FinishColoringRow) Case "TEXT"
Call ColorTextColumn (colletter, ncolors, break, Color, StartColoringRow, FinishColoringRow) End Select
Then, when the rule in control panel 144 is selected for execution, the rule
is applied to the selected column(s) (denoted in cell 148) of the named sheet (in
cell 146). For each column in the named sheet, the value in each cell is compared
to the various breakpoints and the cell is colored corresponding to the appropriate
breakpoint.
Examples of the application of various coloring rules in the "COLOR
CONTROL" worksheet of FIGURES 4A-4B, are shown with reference to the data in
worksheet "DEMO 2" (depicted in FIGURES 9A, 9B, 10A and 10B).
Color-Mediated Data Mining
As noted above with reference to FIGURE 2, once the data have been
colored according to the user-selected rules (at 126), the user can then perform
color-mediated data mining (at 128). The presently preferred embodiment of this
invention provides five mechanisms (each discussed below) for color-mediated
data mining, namely mechanisms to:
1. enlarge/shrink cluster starts;
2. vertically re-scale the display;
3. score and sort clusters; and
4. score and sort dose-response data. The following discussion refers to the cluster control worksheet which is
shown in FIGURES 11A-11B.
1. Enlarge/Shrink Cluster Starts
The "Enlarge Cluster Starts" mechanism highlights the first row of each
cluster in clustered data by enlarging the font of the cell containing the cluster number or label, thus enabling size reduction of the spreadsheet for the user to focus on the color patterns. When the cell height is dramatically reduced in order
to see more cells on a screen or printed page, this enlargement allows the user to still read the label at the beginning of each cluster. The mechanism takes user
input from a Cluster Control worksheet. A corresponding mechanism ("SHRINK
CLUSTER STARTS") allows for undoing the enlarging. This mechanism handles
cluster numbers or textual labels. Any column can be designated as the cluster labels to be processed. Operation of the mechanism is as follows:
(1) From the "CLUSTER CONTROL" sheet 314 pick a control panel to
use (one which is empty or one containing inputs no longer
needed). On this sheet, a single control panel extends vertically
through the black, blue, red, and green sections, and provides input
information for several tools.
(2) In the blue section, enter a sheet name and the column to be considered as the cluster labels.
(3) Click either the blue-text "Enlarge Cluster Starts" or "Shrink Cluster Starts" button. The program code accomplishes this by scanning the column of cluster labels, identifying any entries that are different from the one immediately above,
and enlarging them. Code that carries out this function is shown below:
For Each c In Range (Cells (3 , colnum), Cells (lastrow, colnum) ) irow = c.Row - 1 icol = c. Column
If c.Value <> Cells(irow, icol) .Value Then c. Font. Size = bigfontsize
' Rows (Irow + 1) . RowHeight = bigrowheight End If Next c
Example
An example of the application of the enlarge cluster mechanism is shown
in FIGURE 12 which shows the application of a rule (shown in the control panel
FIGURE 11C) from the cluster control worksheet in FIGURE 11B to the data of
worksheet "DEMO 2" as shown after coloring in FIGURES 10A-10B. As shown
in FIGURE 11C, the rule is to be applied to column B of sheet "DEMO 2".
2. Vertical Display Re-Scaling
The vertical re-scaling mechanism operates by taking a user-provided scale
factor and then changing height of data rows to facilitate visualization of large-
scale color patterns. The mechanism leaves column heading heights and column
widths unchanged. This makes headings remain readable and greatly simplifies
examining long columns of data for color patterns. FIGURES 13A-13D show the application of the vertical display re-scale
mechanism according to the present invention. FIGURES 13A-13B show some of
the data in the worksheet labeled "DEMO 3" 306 (FIGURE 13 A shows the first
thirty eight or so elements and FIGURE 13B shows the remaining elements of that
worksheet). As can be seen from the figures, the worksheet "DEMO 3" 306 has three hundred and twenty eight (328) data entries (in rows 2-329). The user can vertically scale the display by selecting "Re-scale Vertical" from the system's
special menu or by pressing a particular control key sequence (e.g., "CNTL- SHIFT-V" in a preferred embodiment). This presents the user with a dialog box
318, as shown in FIGURE 13C, which asks the user to enter a scaling factor relative to the current size. The user enters a scaling factor to enlarge or reduce or restore the display. In the example shown, the user enters a scaling factor of 0.1
which produces the vertically scaled display shown in FIGURE 13D.
Vertical scaling allows a user to get an overview of the data, based on the
coloring.
The portion of the program code presented below carries out the central function of the vertical display rescaling mechanism:
rowspec = "2:" & lastrow ' leaves the headings unchanged, i . e. , readable
If factor = -1 Then
Rows (rowspec) .Rows . AutoFit Else
For irow = 2 To lastrow Rows (irow) .RowHeight =
Rows (irow) .RowHeight * factor Next irow End If After execution of the rescaling mechanism, as can be seen in
FIGURE 13D, the height of each row (except the heading rows) has been scaled by
factor, 0.1 in the example shown. In this manner, all rows of the data are made
visible on a single page, thereby facilitating data analysis.
3. Scoring and Sorting Clusters
Scoring and sorting clusters assign numerical scores to the color patterns
of individual rows or clusters of rows, thereby enabling comparison and sorting of
the clusters by score.
The scoring and sorting mechanism accepts user's designations of colors
and corresponding relative scores. It handles cluster numbers or textual labels.
Any column can be designated as the cluster labels to be processed. The
mechanism scores a user-selected list of columns of data, with user-assigned
relative weights, which need not be equal for all columns.
User input is taken from a Cluster Control worksheet 314 (see
FIGURES 11 A and 11B), which stores any number of parameter sets, each one
with a user-specified name.
The input data is automatically sorted by cluster label before starting, in
order to group the clusters together in case the user has previously sorted the data
by some other criterion. Then scores are normalized to remove the effects of
cluster size, absolute magnitude of scoring points chosen, and absolute size of
weights chosen. The results are written to two new worksheets without altering
the original data sheet. The first derived sheet is for the numerical scores; the
second is like the original, but has the clusters sorted into descending score order, so that the "best" are at the top, removing the need to visually scan a long colored worksheet. The derived output sheets have names that indicate their source data
sheet and the name of the parameter set used for scoring. At the user's option, the system reversibly hides the un-scored columns in the cluster-sorted output sheet,
focusing attention on the data that were used in scoring.
In preferred embodiments, the system detects uncolored cells in the data and offers the user two programmed modes of dealing with them, (uncolored =
entry on user's list of scores or uncolored = "average of other colors in row"), or
the option of stopping to color them manually.
If the user designates a column of individual compound labels as the
"cluster labels," then the system compares single compounds rather than clusters.
The mechanism operates as follows, with reference to FIGURES 11A-11C.
(1) On the "Cluster Control" sheet 314 the user picks a control panel (e.g.,
panel 1100) to use (a panel which is empty or one containing non- needed inputs). On this sheet, a single control panel extends vertically through the black, blue, red, and green sections, and provides input information for several tools.
(2) In the top black section 1102 of the selected control panel 1100, the
user gives this new parameter set a name if not already done. The
name will be used to label the outputs.
(3) In the blue section, the user enters a sheet name (in 1104) and the column (in 1106) to be considered as the cluster labels. Note: To score each compound separately rather than in clusters, enter a column with individual compound labels as the "Cluster Col." (4) The red section of the control panel is divided into two parts, with its action button 1108, with red text "Score and Sort Clusters", in the
middle. Above the button, enter the names of the colors 1110 to be
assigned point scores, along with their corresponding point scores
1112. The scores are arbitrary and relative; they will be normalized by the system as necessary. However, a user should be sure always to assign higher point scores to colors which denote favorable values, and
lower point scores to colors which denote unfavorable values. The cells with entries need not be colored, and need not be in score order,
because the system will color and sort these cells when run.
When assigning point values, a user should be aware that
uncolored cells (which are most likely blank, i.e., unknown data) may have quality values above or below those that contain grouped and colored data.. The user may decide that some of the colored groups
are "better" or "worse" than data being unknown, and can assign a
score to the color "NONE" accordingly.
(5) Below the "Score and Sort Cluster" button 1108, the user enters the
columns 1114 to use for scoring (using the same syntax as for the Data
Coloring) and their corresponding relative weights 1116. The numbers
for weights are arbitrary and relative; they will be scaled by the system
as necessary. Note that a line with multiple columns will assign the
entered weight to each of the columns.
(6) The user the selects (clicks) the red-text "Score and Sort Clusters" button 1108. (7) When the scoring and sorting tool runs (on the system 100), if the
system detects uncolored cells in the data, the user will be offered two modes of dealing with them automatically, or a third manual option of stopping to color them. The two modes are:
• "Use score for the color "none" on my list"
(RECOMMENDED)
• "uncolored = average of other colors in row".
(8) The program then scans the chosen columns in each row and adds up
the chosen column's color scores for that row. These scores are then averaged for each cluster of rows, as defined by the user-selected "cluster column." The VB program code which accomplishes this is as follows:
For icol = 1 To ncols colorcode = Cells (irow, colnum(icol) ) . Interior .Colorlndex colorfound = False ' Add up the weighted scores For j = 1 To ncolors
If (icolor(j) = colorcode) Then j score = j colorfound = True Exit For End If
Next j
If colorcode = xlNone And treatblanks = "AVERAGED" Then colorfound = True
If Not colorfound Then Cells(irow, colnum (icol) ) .Select
If colorcode = xlNone Then thiscname = "none" Else
Call ColorNameToIndex (thiscname, colorcode, True) End If addscore = score (jscore) * colweight (icol) cmpdscore = cmpdscore + addscore ' Next IF-THEN-ELSE block is
' special calculations for the "averaged" mode If colorcode = xlNone Then lostweight = lostweight + colweight (icol) minscore = Application.Min (minscore, 0) maxscore = Application.Max (maxscore, 0) Else cmpdscore2 = cmpdscore2 + addscore End If
Next icol
( 9 ) The scores are then normalized for the various cluster sizes (number
of rows per cluster), and scaled to a value of one hundred (100) for a
row which is colored entirely with the user's highest-scoring color and
a value of zero for a row that is colored entirely with any color to
which the user has assigned a score of zero.
If clusterscore (icluster) = 0 Then ' do nothing
Elself clusterscore (icluster) > 0 Then clusterscore (icluster) =
100 * clusterscore (icluster) / (nrows * maxscore) Elself clusterscore (icluster) < 0 Then clusterscore (icluster) =
100 * clusterscore (icluster) / (nrows * (-minscore)) End If (10)The results are presented as two newly inserted worksheets. The first
is named by appending the word "SCORES" to the name of the
original data sheet, and contains a list of the clusters with their sizes
and scores.
(1 l)The second new sheet is named by appending the word "SORTED" to
the name of the original data sheet. The "SORTED" sheet contains a
copy of all the original data and coloring, but with the rows re-ordered
to place the highest-scoring clusters at the top, and all the clusters in
descending score order from there down.
(12)The user has two additional options regarding the appearance of the
"SORTED" sheet: (a) a column containing the numerical scores can be
added; and (b) the columns that were not used in the scoring can be
hidden, so that only the ones actually used remain visible.
An example of user provided data is shown in the control panel in
FIGURE 11D which is taken from the cluster control worksheet shown in
FIGURE 11 A. As shown in FIGURE 11 A, the parameters are stored with the name
"Cmpd" 1102. The scoring a sorting parameters in the control panel 1100 of
FIGURE 11D give the color red a score of "-1", orange has a score of "0", yellow
has a score of "1" and light green has a score of "2". Columns C and D have
relative weights of "1", as does column E.
Note on the output of score and sort clusters: The system inserts two new
sheets after the data (see, e.g., FIGURES 14C-14D). The first added sheet contains
two score columns: the scores generated by both of the auto modes (uncolored =
zero and uncolored = average), but the one not selected will be gray. The scores are on a scale of "-100" to "+100", where a score of "-100" means that all cells
had the maximally negative score available, and a score of "+100" means that all cells had the maximally positive score available. The second added sheet has clusters sorted according to the one auto mode chosen when the tool ran. The
routine offers to hide all columns that were not used in the scoring and sorting.
The user can selectively unhide certain columns by using the "Edi GoTo" menu
option (or typing "CTRL-G"), enter the columns in the "Reference" box (for example, C:F), then pick the "Format: ColummUnhide" menu option.
If the user wants to see a color-score-sorted list of compounds within a particular cluster (such as the best cluster), the user should do the following:
1. Sort by clusters to find the ID of the cluster wanted.
2. With a second rule, sort by compounds.
3. Go to the "SORTED by Compound" results sheet and turn on EXCEL' s
"Data:Filter:AutoFilter" feature for the column that specified the
clustering in the first sort. The user can then choose to view only the
compounds in one particular cluster, and they will be in compound-
sorted order.
Example
With reference to the already-colored worksheet "DEMO 1" shown in
FIGURES 8A-8B, the cluster control worksheet shown in FIGURE 11 A, and the
control panel shown in FIGURE 11D, application of the scoring and sorting of
clusters is described. As noted above, in the control panel of FIGURE 11D, the
parameters are stored with the name "Cmpd" 1102. The color red has a score of "-1", orange has a score of "0", yellow has a score of "1" and lightgreen has a
score of "2". Columns C and D have relative weights of "1", as does column E.
Application of control panel "Cmpd" of FIGURE 11D, by selecting "Score
and Sort Clusters", produces the worksheets shown in FIGURES 14A-14B. When
the user selects the "Score and Sort Clusters" button 1108 for the "Cmpd" control
panel of FIGURE 11D, the system first presents a dialog box (1402 shown in
FIGURE 14A) asking the user how un-colored cells should be scored for sorting.
As noted above, un-colored cells can be scored explicitly by user entries
(recommended) or as the average of the colors in the same row. Once the user
makes a selection and clicks on the "OK" button, the system scores and sorts the
data, producing the display screen shown in FIGURE 14B. The system provides a
summary of what was done, including the information about the two new sheets
("DEMO 1 SCORES by Cmpd" and "DEMO 1 SORTED by Cmpd un=ze")
which are added to the workbook. FIGURES 14C-14D show the data in the newly
created worksheet "DEMO 1 SCORES by Cmpd".
Example
With reference to the already-colored worksheet "DEMO 2" shown in
FIGURES 10A-10B, the cluster control worksheet shown in FIGURE 11 A, and the
control panel shown in FIGURE 11C, application of the scoring and sorting of
clusters is described. In the control panel of FIGURE 11C, the parameters are
stored with the name "acids" (1102). The color red has a score of "0", orange has
a score of "1", yellow has a score of "2" and light green has a score of "3".
Column D has a relative weight of "1". The application of the parameters or rules in the "acids" control panel
produces two new worksheets ("DEMO 2 SORTED by acids" and "DEMO 2
SCORES by acids") shown in FIGURES 15A-15B.
4. Score and Sort Dose-Response Data.
Data grouping and visualized by color coding has also been found to
enable an automated solution to another vexing pattern recognition problem. An
HTS lab is currently able to provide dose-response data on some subset of the
whole collection of compounds originally tested. Sometimes, logistical
constraints (time and/or cost) dictate that only a few concentration points can be
run on each compound, and the high-throughput nature of the process generates
somewhat noisy data. A similar situation sometimes exists in other biological
laboratories where assays are very time-consuming. Dose-response curves with
few, noisy points are difficult to analyze by traditional curve-fitting methods. The
present invention includes a mechanisms/algorithms for analyzing percent-of-
maximal-effect data and accurately ordering the compounds by potency, even
when faced with few points and high noise.
The mechanism recognizes two properties of the dose-response data for
each compound:
1. "Dose-responsiveness," the drop-off of activity with dilution, is taken
as a sign that the compound has some reasonable pharmacological
mechanism of action. 2. The activity measurements at the various concentrations also provide a
confirmation of the general level of each compound's activity that was
indicated by the original single-poke HTS hit.
These two properties are somewhat independent, as illustrated by the
example of a compound that is 95% active at all tested concentrations. It
demonstrates very poor (i.e., no) dose-responsiveness over the range of
concentrations tested, but is so active that it should not be ignored, because it
might reveal a dose response if tested at even lower concentrations.
By using the data groupings and color codes of the dose-dependent activity
data columns, which help to smooth out excessively fine distinctions in the
numbers, this invention includes an algorithm to assign numerical scores for dose-
responsiveness and overall activity in the dose-response data. Moreover, the
algorithm also calculates a smart composite of these two scores, in such a way that
a highly active compound will get a high composite score even if its dose-
responsiveness is poor. This composite score is capable of extracting useful
information, even from very noisy data, and has been validated to correctly order a
list of test compounds. The system of this invention adds data columns that report
all three scores for each compound, and these columns can themselves be color
coded, and thus used in further comparison to other types of data by compound or
cluster scoring and sorting as described above.
Moreover, within certain limits, the invention's dose-response scoring
algorithm can also be used to make quantitative estimates of IC50 values of
compounds, even in the presence of large amounts of experimental error. This is
accomplished by adding a set of hypothetical marker compounds with known
potencies and theoretically calculated activities at the test concentrations. Since the ordering algorithm is reliable, these markers will be ordered into their
appropriate place, and can be used to calibrate the ordering scores in terms of
actual IC50's. In other words, estimates of IC50 for the compounds can be generated by interpolating between the markers in the ordered list of composite
scores.
Scoring and sorting dose-response data according to the present invention
processes several columns of colored dose-response data (activity vs. concentration) to assign three numerical scores that can later also be color coded,
and thus used by the "Score and Sort Clusters" mechanism (described above) to compare compounds or clusters of compounds. The three scores are:
(a) degree of dose-responsiveness over the concentration
range tested;
(b) overall activity level demonstrated in the dose-response
data; and
(c) a variably weighted composite of (a) and (b), designed
to give high scores for high activity even when dose-
responsiveness is poor (e.g., a compound that is highly
active at all concentrations).
The scoring and sorting dose-response data according to the present
invention bases its scoring on colors rather than absolute activity numbers. The
mechanism takes user input from a Cluster Control worksheet, e.g., as shown in
FIGURES 11A-11B. FIGURE 11B shows a control panel from the cluster control
worksheet shown in FIGURE 11 A, wherein the user has selected columns F to I of
worksheet "DEMO 1" for scoring dose-response. The system detects uncolored data, notifies the user, and asks whether to continue. If yes, the system skips the row containing the uncolored data. The
system inserts three new columns on the original spreadsheet to contain the new
scores, the new columns immediately following the columns of dose-response data. The column headings show the name of the parameter set used for scoring.
Preferably, the system offers to regenerate existing table of Sheet Statistics to correct it for newly added score columns. Further, the system offers to sort the
data rows by decreasing score. The system also offers to carry out quantitative
estimates of IC50 values for the user's compounds, by adding artificial calculated calibration marker compounds .
In order to score and sort dose-response data:
(1) Ensure that the dose-dependent activity data columns are ordered with highest concentration at the left and lowest concentration at
the right. To ensure this, the system will remind the user of this
requirement and ask him to confirm it when this tool is run. Note:
If the data are for an undesired effect such as toxicity, the columns should be ordered the opposite way (lowest concentration left, highest right).
(2) Use the Data Coloring (described above) to color the dose-response
data columns.
(3) Go to the Cluster Control sheet 314 (FIGURES 11A-11D) and pick
a control panel 1100 to use. On this sheet, a single control panel extends vertically through the black, blue, red, and green sections, and provides input information for several tools. (4) In the top black section, give this new parameter set a name (1102)
if not already done. The name will be used to label the outputs.
(5) In the blue section, enter a sheet name (1104).
(6) In the red section (1110), enter the colors used to color the dose-
dependent data, and relative point scores (1112) to be assigned to
these colors.
(7) In the green section (1118), enter the columns which contain the
dose-response data (using the same syntax as for Data Coloring).
(8) Click the green-text "Score Dose-Response" button (1120).
(9) If the data are expressed as "percent of maximal effect," the user
can follow the prompts to add calibration markers and make
quantitative estimates of IC5o's.
Note on the output of score and sort dose-response: the system inserts
three score columns after the dose-dependent data. The three scores are all scaled
to a 0-100 range, and have meanings as follows:
(a) degree of dose-responsiveness over the concentration range
tested:
100 = smoothly decreasing with dilution, spanning the
entire range of color groups;
75 = flat dose-response; and
<75 = even more poorly behaved
(b) overall activity level demonstrated in the dose-response data
100 = highest activity color group at all concentrations. (c) a variably weighted composite of (a) and (b), designed to give
high scores for high activity even when dose-responsiveness is poor (e.g. a compound that is highly active at all concentrations).
The Dose-Responsiveness Scoring Algorithm
The data columns are ordered left to right, by decreasing concentration.
The scoring algorithm awards positive score points for each dilution step across the data that actually shows a decrease in the activity data group (i.e., the color), and to penalize every step that does not. The algorithm uses the following
scoring:
• +1 point (awarded) when a dilution step moves to a lower activity group
• 0 points when a dilution step leaves the activity group unchanged
• -3 points (penalty) when a dilution step moves to a higher activity group
The maximum score would go to a compound that shows all the possible
color group steps in the right direction, and has no reversals. The minimum score
would go to a compound with all the possible reversals, and no correct steps. The
program then scales the extremes to 100 and 0, in order to present a consistent
interface to the user.
The relative magnitudes of the scoring parameters were empirically arrived
at by testing "complete sets" of color patterns. This is possible because of the data
simplification afforded by the value grouping. If we define the following
numerical parameters:
C ≡ number of colors used, i.e., number of data value groups P ≡ number of points measured, i.e., number of different concentrations (doses) tested,
then the entire "universe" of possible color patterns includes (Cp) different cases. For some typical values that might be encountered in real HTS data, this total
number of cases is manageable in EXCEL, as shown by TABLE 1, below.
TABLE 1. Total Number of Color Patterns
P = C = total number
# of cone. # of color groups of
Points possible cases
3 3 27
3 4 64
4 3 81
4 4 256
5 3 243
5 4 1024
6 3 729
6 4 4096
7 3 2187
7 4 16384*
* For a spreadsheet with a heading row, this exceeds EXCEL' s current limit (for EXCEL 95) by one. This value should not exceed the limit for Excel 97. Scoring was done on several of these complete sets. In each set, the results were sorted by decreasing score and compared to "intuition" for general
correctness of ordering of dose-responsiveness, and scanned for cases deemed to be clearly out of order. The (+1, -3) score set was found to produce satisfying ordering, while lesser penalties led to poorly ordered results. More objective tests of ordering (described below) were then used to further validate the algorithm
The case of P=3 and C=3 is presented below in its entirety for illustration.
TABLE 2 (FIGURE 16F) shows artificial data and processing for twenty seven (27)
hypothetical compounds. The "percent inhibition" columns represent assay "data." If one defines three groups by breakpoints at 33% and 66%, each cell is assigned to a data group as shown in the middle set of three columns. Here it is
clear that the order of compounds in this table is systematic (111, 112, etc.), to
illustrate that the complete set is present. The third set of three columns shows
color coding, with the darkest being least active and the lightest being most active.
Then the data set was processed by the system to yield dose-
responsiveness scores, and the results sorted by this score, giving TABLE 3
(FIGURE 16G), the complete set in order of decreasing dose-responsiveness.
TABLE 3 also shows the intermediate step-scoring and unsealed score points, to
aid in following and understanding the algorithm. These points are not displayed
by the system itself.
The Overall-Activity Scoring Algorithm
The second property of interest to be extracted from the data is the overall
activity level exhibited by each compound. As explained above, this is largely
independent of the dose-responsiveness.
The data value groups' ordinal index numbers are used as single-point
activity measures instead of the original data numbers. Extra weight is given to
activity shown at lower concentrations by the simple algorithm of weighting each
data column by its serial position, again ignoring the actual concentration values.
The scores are then scaled to the range 0 to 100. The results of this scoring on the
same complete set are shown in TABLE 4 (FIGURE 16H) which has been re-sorted
by decreasing overall activity. The Composite Scoring Algorithm
Comparison of Tables 3 and 4 (FIGURES 16G & 16H) shows clearly that
the compound ordering by dose-responsiveness is quite different from the ordering
by overall activity. The user (a chemist) could now color-code the new score
columns and use them as independent factors in a larger scoring. However,
chemists also want a single index of compound quality derived from the dose-
dependent data. Moreover, a composite index would further help to alleviate the
effects of noise on data interpretation, by incorporating more information into the
ordering process. This is an "information-based smoothing" of the data.
Therefore, a procedure to calculate a third, "smart composite" score from the other
two scores was devised.
The general idea is that when selecting good compounds from dose-
response data, compounds showing overall high activity should not be discarded
for lack of responsiveness. Therefore, the smart composite score should give
more weight to the overall activity when the overall activity is high, but lower
weight when it is low. A generalized weighted average is written as
composite score = (activity weight) (activity score) +
(responsiveness weight) (responsiveness score)
or, defining corresponding symbols:
SC = (WA)(SA) + (WR)(SR)
If the weights are normalized to sum to unity, then this becomes
SC = (WA)(SA) + (1- WA)(SR)
The activity weight WA varies with the activity score SA in such a way as
to achieve the desired result. The functional form of this variation was the subject of empirical testing.
It was decided that the limits would be that WA would approach 0.5 (activity and
responsiveness equally weighted) in the limit of low activity, and that WA would
approach 1.0 (responsiveness ignored) in the limit of high activity. The actual
variation was encoded as an exponential increase in order to have rather sharp
onset of the activity bias at higher activities:
WA = (C1) exp [(k)(SA)] + C2
The value of the coefficient k=0.06, for which the activity bias starts to
become substantial around an activity score of eighty (80), was chosen for
implementation in a preferred embodiment of this invention, according to
empirical results. FIGURE 161 shows the variation for a few values of k. TABLE 5
(FIGURE 16J) shows all three scores for the example complete set, now sorted by
decreasing composite score.
The details of the scoring algorithms were arrived at largely by comparing
results to intuitive ordering of the test cases in the complete sets. Because the sets
were complete, no really new results can be generated by further test sets.
However, one can generate test activity data sets from compounds of known
potencies, whose real rank ordering is thus known, in order to see more
objectively how well the scoring algorithms rank the results.
To this end, a set of pseudo-ligands was hypothesized, with dissociation
constants from a fictitious receptor ranging from nanomolar to millimolar (pK = 9
to 3). The set included thirty one (31) compounds, with potencies evenly spaced
by 0.2 log units (9.0, 8.8, 8.6, ... , 3.4, 3.2, 3.0). A "pseudo-screen" was created
which "tested" binding of these ligands at five concentration points in the usual range: 10"5 M, 3xl0"6 M, 10"6 M, 3xlO"7 M, and 10"7 M. Note that the span of
potencies exceeds the span of concentrations tested by two log units on each end,
so the test set includes both "very active" and "very inactive" compounds relative
to the screening concentrations.
Then artificial binding data were created by calculation as follows.
Assuming a simple binding equilibrium of the ligand to a receptor, the "percent
inhibition" at a given ligand concentration is equal to the fraction of receptor sites
which are occupied by ligand, given by simple equilibrium equations as
Pinhib = 100 • (ligand) / [K + (ligand)]
For a more realistic simulation, artificial random noise was then added to
the calculated numbers. The first experiment reported below used noise randomly
distributed over the range of ±10 inhibition percentage points, and the second
with noise up to ±30 inhibition percentage points. Note that this means ten or
thirty percentage points of absolute error, not 10% or 30% of the value.
The artificial data were then color-coded according to the mechanisms of
this invention (described above) into four color groups, using the simple
breakpoints at 25, 50 and 75 percent inhibition. Note that in assigning these
breakpoints, no consideration was given to the actual data values. Then the
scoring algorithms of this invention were run, and the compounds sorted by the
composite score. Rank order numbers were assigned to the compounds, with 1
being the most potent and 31 the least. In cases of ties in the composite score,
equal rank numbers were assigned, with a value equal to the average of the rank
numbers spanned by the tied group of compounds (e.g., a tie for 2nd and 3rd would result in each compound being ranked "2.5"). For each experiment, the
final rankings were plotted against the "real" rankings by known potency, to test
how well the scoring algorithms ordered the compounds. These plots are shown
in FIGURE 16K (for noise=10) and FIGURE 16L (for noise=30).
For the experiment with noise up to 10 inhibition percentage points, shown
in FIGURE 16K, the ranking of the composite scores is "perfect" (in the sense of
having no inversions) over the range of tested concentrations (pK = 5 to 7). The
pseudo-screen is unable to distinguish the potencies of compounds above or below
this range.
When the noise is much higher (30 percentage points), the ranking of
individual compounds is not as precise, but one can identify three cleanly divided
"good-medium-bad" groups, as indicated by the dashed boxes on FIGURE 16L.
Thus, even with this rather extreme noise level, the invention's scoring still
successfully prioritizes the compounds into groups. The range where
discrimination is effective is still roughly the range of the test concentrations (pK
= 5 to 7), but has been reduced somewhat by the higher noise. Note that the
ranking within this range (the middle boxed group) is still mostly correct, with
only one inversion, even for single compounds.
Quantitative Estimation of Potencies
With confidence established that the algorithms provide reliable rankings
of compounds by potency, it is possible to proceed to making quantitative
estimates. The method uses calibration marker compounds.
To understand this method, it is helpful to realize that the concept is
analogous to the quantitative use of SDS polyacrylamide electrophoresis gels to measure protein molecular weights. The proteins are known to migrate through
the gel with speeds directly dependent on molecular weight, but it is difficult to
calculate the absolute migration rates for a particular experiment. In dose-
response scoring, the compounds are known to be properly ordered, but it is not
clear how to calculate a potency (e.g., KdiSS or IC50) directly from the score.
Protein chemists solve the molecular weight problem by running marker
proteins, with know molecular weights, in the same gel, then using their band
positions as calibration for the unknowns. Analogously, this invention's
quantitative estimation method uses hypothetical marker compounds of know
potency to internally calibrate the dose-response composite scores for the user's
choice of a coloring rule, then interpolates the potencies of the unknowns.
To create markers, the system asks the user to input the concentrations
used for each of the dose-dependent activity data columns. The system then picks
a set of calibration concentrations, at intervals of 0.5 log units, to span the tested
range. For each of these calibration concentrations, a marker compound is created
and added to the user's compound list, and artificial data is calculated for each
column, from the same simple equilibrium binding equation used above in the
validation study (this time with no "noise"):
Pinhib = 100 • (ligand) / [K + (ligand)]
The marker data are then colored by the same rule used for the user's
compounds, and the scoring and sorting algorithm is re-run.
The result is that the markers are sorted into the list according to their
potencies, and the potencies of the other compounds can be estimated by interpolating between the markers, using the composite dose-response scores. To
illustrate, a typical section of a sorted list is shown below in TABLE 6
(FIGURE 16M), using four colors.
Potencies for compounds that fall between two markers are calculated by
linear interpolation between the logarithms of the markers. Given the various uncertainties in the data values themselves and in the evaluation process, it was
found that linear interpolation between markers spaced at 0.5 log unit intervals was sufficiently precise, and no more complex curve fitting was necessary.
Validation of Quantitative Estimation
Validation of the quantitative estimation method followed a procedure very similar to that used to validate the scoring, and using the same sets of test data with various noise levels. As before, the testing concentrations were from 10"
5 to 10"7 M (negative log from 5 to 7). Marker compounds (no noise) were added
with pK's from 4.5 to 7.5, and Kdjss estimates for the noisy compounds were
carried out by the interpolation method. The results are shown below for the cases
of 10 and 30 inhibition percentage points of noise.
FIGURES 16N and 16P show that the estimates are clearly quite good
within the range of the testing concentrations (pK 5 to 7), but the quality of
estimation deteriorates quickly beyond those limits, and algorithm does not
reliably distinguish among compounds whose potencies are more than a half log
unit beyond the testing range. Therefore, it was decided that presently preferred
embodiments would not report any estimated values that fell outside the range of
concentrations used in the testing data columns. Thus, in the example in TABLE 6
(FIGURE 16M), the lowest testing concentration was 10"7 M (= 0.1 μM). For the first compound in TABLE 6, the system has estimated a potency with pIC50 > 7,
but it conservatively only reports "<0.1 μM."
TABLE 7 summarizes the statistics of the estimations within the testing
limits. TABLE 7 shows that the method successfully estimates the potencies within about a factor of two, even with high noise levels.
TABLE 7. STATISTICS OF ESTIMATION VALIDATIONS
Figure imgf000055_0001
Comparison to Other Methods of Quantitative Estimation
Further corroboration was obtained by treating some real data from T-cell
proliferation blockage assays. It is estimated that these data have at least as much noise as the artificial test set with 30 inhibition percentage points added. The standard treatment of this data in the past has been to fit a dose-response curve with a Hill coefficient of 1 , using a PC-based program ORIGIN. (ORIGIN is a data analysis program from Microcal Software, Inc. of Northampton, Massachusetts. ORIGIN is used in this instance for non-linear least-squares
fitting of dose-response curves to functional equations.)
The data used here were from testing in the concentration range from 1 to
0.03 μM (negative log from 6 to 7.5). The plot in FIGURE 16Q shows the
correlation of values estimated by this invention with values from ORIGIN fits. Two compounds that the present invention estimates to be beyond the testing
range, i.e., pIC 0 below 6, are included as open diamonds, for illustrative purposes
explained below. (As explained above, preferred embodiments of this invention
would normally not report these values.)
The results are consistent with the properties observed in the validation
study. The present invention estimates are quite good within the testing range (6
to 7.5). At the lower limit, this invention has made two estimations exactly at 10'
M (arrows) which do not correlate as well with the ORIGIN fits. Nevertheless,
because the whole plot spans only a relatively narrow range of potency, even these
discrepancies are not very large. For all twenty eight (28) estimates within the
testing range (including these two), the average logarithmic deviation between the
two estimation methods is 0.18, corresponding to a factor of only 1.5.
It is further noted that the calibration marker estimation method does not
uniformly "flatten out" beyond the testing concentration range. The two open
diamonds in FIGURE 16Q are estimations that presently preferred embodiments of
this invention would not normally report because they have pIC50 < 6, but they
agree well with ORIGIN fits
It is interesting to compare calibration-marker with curve-fitting results for
particularly badly behaved data, such as dose-response curves that are not
monotonic with respect to concentration. This is sometimes the type of data that
emerges from dose-dependent screening in a high-throughput mode. Studies of
this type have been initiated by adding artificial noise to the extent of fifty (50)
inhibition percentage points.
Finally, it should be pointed out that there is some mechanical advantage
to using the present invention relative to current practice of using ORIGIN. ORIGIN is used by manually filling in a template with data, then manually
executing a fit. Depending on the number of points and the degree of
customization of parameters, this can take one to ten minutes of the user's time.
The present invention, on the other hand, processes a whole spreadsheet at once
(i.e., up to 16,383 compounds), and goes at a rate of about 3,000-4,000
compounds per minute on a 200 MHz PC.
Examples
FIGURE 16A shows dose response data for twenty (20) compounds at four
concentrations. The data have been grouped and the cells colored by the rule
shown in FIGURE 16B. The result of the scoring and sorting process is shown in
FIGURE 16C, where the compounds are ordered by decreasing values of the
composite score (column H). Then, virtual "marker" compounds are added with
known potencies spaced by 0.5 log units, and they are shown in FIGURE 16D,
colored by the same rule and scored. The name of each marker compound
designates the logarithm of its potency, e.g., "marker_7.0" has a potency IC50 =
10"7 M. FIGURE 16E shows the result of sorting the list by decreasing composite
score after adding the markers. This process then enables estimation of IC50
values for the compounds by interpolating in the column (H) of ordered composite
scores, and these estimates appear in two forms in columns I and J.
6^ Summarize Spreadsheet Statistics Mechanism
This mechanism creates a table summarizing the entries in each column of
a data sheet, to aid the user in deciding how to color each column. The mechanism counts numeric, text, and data entries, and uses color to flag columns
that have mixed types. The mechanism also counts blanks, and specially flags
columns with "trailing blanks," i.e., columns which are shorter than the longest
one on the spreadsheet. For numeric data, the mechanism calculates minimum,
maximum, mean, and standard deviation, even in the presence of interspersed text
entries. For text data, the mechanism presents a list of the text strings used and
their occurrence counts. The mechanism creates a summary key of the column
letters and headings as a text box that can be copied to other sheets for convenient
reference.
FIGURE 17A shows a sample spreadsheet containing miscellaneous data
on twenty four (24) compounds. FIGURE 17B is the statistics sheet calculated
from it. Each row of the statistics sheet describes one column of the original data
sheet. First, the counts of numeric, text, date, and blank entries are listed,
followed by two columns describing the total length of the data sheet. Then the
minimum, maximum, mean, and standard deviation of any numeric data are
reported. Finally, the statistics sheet lists a summary of the text strings found in
each original data column. As examples, in FIGURE 17B, one can see that
original column A ("Cmpd") had twenty four (24) different text strings, that the
numeric data in original column C ("Testl") had a mean of 2.385, and, flagged by
the red coloring, that original data column E ("Test3") had a mixture often
numeric data and two text strings, both "N.A."
The details of how the program code accomplishes this are
straightforward, and one of ordinary skill in the art would know, from this
description, including the Figures, how to make and use this invention. The program loops through all the entries in the column, testing the data type of each,
and tallying the counts and numerical statistics.
Spreadsheet Creation and Organization
The operations of this invention require a considerable amount of user
input, e.g., to create well-structured spreadsheets, to define and apply diverse
coloring rules for large numbers of columns, and to use these colors and the user's
stated scientific priorities to create meaningfully ordered lists of compounds or
clusters.
The user interface of this invention has been designed to ease this process
and help the scientist focus on the tasks of formulating and recording clear
descriptions of the evaluation parameters. Accordingly, this invention provides a
number of tools and mechanisms to aid in the creation and organization of
spreadsheets. These tools and mechanisms include:
• Smart Append Column Mechanism
Merge Data Mechanism
Data Import Mechanism
Workbook Navigation Shortcuts
Conversion of "uM" to μM and "UU" to μ
• Delete Pictures Mechanism
Change Values in Column Mechanism
Concatenate Values across Columns Mechanism
Delete Leading Inequality Signs Mechanism
Delete Derived Sheets Mechanism Smart Append Column
This mechanism appends new columns of data onto an existing
spreadsheet, matching rows by labels (e.g., compound numbers). The mechanism
copies all data to a new sheet before doing its work, leaving the original sheets
unchanged. There is no need for the user to pre-sort any of the data. The
mechanism provides optional case-sensitive or case-insensitive label matching.
New rows are added at the bottom when new labels do not match any old
labels. Rows with missing labels are identified and the system offers to fill them
by copying previous label. Rows with repeated labels (i.e., replicate data) are also
identified and the system offers a choice from among several automated
processing rules, or manual fixing. A fast matching algorithm temporarily sorts
rows by label, then restores original order when finished. Several intermediate
stopping points are offered and extra data viewing options for conservative users
worried about errors.
Merge Data Mechanism
The merge data mechanism copies new data values from an appended
column into an older column. The mechanism copies all data to a new sheet
before doing its work, leaving the original sheets unchanged. The mechanism
detects cells where new data would overwrite old data that is different, flags them
with color, and alerts the user. Several intermediate stopping points are offered to the user, as are extra data viewing options are offered for conservative users
worried about errors.
Data Import
One-button (or one-menu-click) import of existing EXCEL spreadsheets into an integrated file, which contains both the data and the related control sheets.
The mechanism offers to search for and remove any leading or trailing spaces in
the imported data and offers to consolidate replicate data rows into unique ones, using user choices as to how to handle the replicate data. The mechanism also detects hidden rows and offers to unhide them and detects formulas and offers to convert them to values. This mechanism is also used to update to newer version
of the system.
Workbook Navigation Shortcuts
The system includes various workbook navigation shortcuts including:
• A special added drop-down menu which includes commands for jumping
directly to the various control sheets. These commands also have keyboard shortcuts assigned to them.
• From a cell on a control sheet that contains the name of a data sheet, a
special item on the right-click shortcut menu jumps directly to that data
sheet. Other special items on this menu enable a "Twin Screen" display to
see two sheets at once. • To aid in choosing columns to enter on control sheets, there is a special
"Twin Screen" display triggered by right-clicking any cell on a control
sheet that contains the name of a data sheet.
Convert "uM" to uM and "UU" to u
Preferred embodiments of the system of this invention require the data
spreadsheet to have one and only one row of column headings. The user can type
either of the encoded strings "uM" (lowercase u, uppercase M) or "UU" (both
uppercase) into any column heading, select the cell or whole row of headings, then
pick this command. Each "uM" in the selection will be converted to "μM", and
each "UU" will be converted to a "μ". The code recognizes the special exception
of the word "VACUUM" as long as it doesn't end with the cases "uM." This
conversion allows the user to avoid the confusing use of lowercase "u" or the
column-widening use of the full prefix "micro." This utility appears on the
system menu.
Delete Pictures
The system provides a mechanism for removing pictures containing
chemical structures, in order to reduce file size, processing time, and confusion
when they do not align properly after row sorting. Change Values in Column
This is a mechanism for regularizing data in a spreadsheet column. It
facilitates replacement of all occurrences of a given value by another. The mechanism creates backup copies of the original column, and updates any existing
data statistics for the edited sheet.
Concatenate Values across Columns
The system provides a mechanism for regularizing data in a spreadsheet
column. Some possible uses include: (a) construction of unique row labels: M-
number plus stroke number - "M123456/001"; and (b) reconstitution of
numerical inequalities from separate columns: ">" plus a number — » ">number". The user is provided with an option to include linking (delimiting) text strings between values and an option to include or skip blanks. The system retains the original columns and inserts a new one for the results.
Delete Leading Inequality Signs
Another mechanism for regularizing data in a spreadsheet column includes
the mechanism to delete leading inequality signs. This mechanism converts
entries like ">1000" to just the number "1000". This must be used with
considerable caution, because it is the equivalent of creating a false test result. It
is generally preferable to color the cells containing text strings with the data
coloring mechanism described above, rather than alter them. All later processing
is based on the colors, not the cell values. This mechanism also deletes inequality sign only if it is the first character in the cell. The mechanism creates backup
copies of the original columns.
Delete Derived Sheets
The system menu includes a command to delete all output sheets from the
current workbook, with separate user confirmation for each one. This is intended
as a cleanup mechanism for information that may be outdated and is easily
regenerated by subsequent system runs.
Initial experience with the coloring tool has revealed that color coding has
more subtle, but far-reaching usefulness. The colors themselves also can act as a
"currency of exchange," a medium for comparing the quality of one kind of result
to the quality of a very different kind of result. For example, an HTS activity of
"95% inhibition" may be considered desirable and color coded, e.g., green. In the
same list of compounds, a molecular weight between 400 and 600 may be consid¬
ered optimally desirable, and thus also color-coded green. If the user takes care
when assigning colors, "green" takes on a common meaning across the board.
This translation of data values into colors then opens up a cornucopia of
possibilities for processing the colors (as numerical color indices) and comparing
compounds, searching, in our example, for the ones that are the "most green."
Accordingly, the system includes tools to numerically score individual
compounds or clusters of compounds by the colors that appear in their various
data columns. The system can then create a new spreadsheet sorted by this score (either by single compounds, or cluster-by-cluster, the choice being the user's), in
which the "most green" compounds will then appear at the top.
Examples
Application to Portfolio Management
The system can equally well be applied to any set of data where the rows
are cases of a similar construct, with the columns being various properties of each
case. For example, a data spreadsheet can contain a list of competing projects or
investments for a company's portfolio, with the columns containing various
managers' ratings of each project or investment. FIGURE 18A shows an example
of twenty projects, each of which has been scored 1, 2, or 3 on two factors, one
more important than the other, by each of three managers. The sheet has been
colored by the rule shown in FIGURE 18B. Then the data were scored and sorted
by the sorting rule of FIGURE 18C, and the result is shown in FIGURE 18D.
Clearly, the projects that were given a "3" in the important factor come to the top,
and it can be seen that the less important factor does indeed matter less to the final
ordering. The colors also help to flag anomalies, such as a low score by one
manager on an otherwise high-ranking project.
In general, the data can be various sorts of data. Some examples are listed
below and illustrated in the referenced Figures.
FIGURE 19 shows a list of drug candidate compounds, scored and sorted
by a composite of ten parameters that describe their physical, chemical, and
biological properties. Green shades indicated desirable values; red shades are undesirable. The display is compressed vertically with the vertical re-scaling tool
to clearly display the difference in coloring patterns between the top eighty (80)
compounds and the bottom eighty (80) compounds (separated in the illustration by
a blank band).
FIGURE 20 shows a list of proteins that are candidates for targets for drugs,
chosen from a pool of candidate genes, scored and sorted by a composite of eleven
parameters that describe their suitability.
FIGURE 21 shows a list of research projects competing for resources. Each
project has been scored according to several evaluation factors, and the whole
array has been sorted by color groups. The same construct is useful for evaluating
employee performance or job candidates.
FIGURE 22 shows a list of pharmaceutical companies and their current
status with regard to discovering or marketing products in each of various disease
areas. Each company's line of products has been scored according to the maturity
of the offerings, and the whole array has been sorted by color groups.
FIGURE 23 shows the use of data-grouping (coloring) rules to visualize the
time courses of drug concentrations in blood. In this example, light colors were
chosen to represent high concentrations of drug in the blood, while dark colors
were chosen to represent low concentrations. The figure shows a wide range of
differing time courses.
FIGURE 24 shows the use of data-grouping (coloring) rules to visualize the
matrix of pairwise cross-correlations of the results of eight (80) drug screens. In
this example, light colors were chosen to represent low correlations, while dark
colors represent high correlations. Quantitation of the Similarity of Data Grouping in Two Variables
As part of the present invention, a mechanism is provided for assigning a
quantitative measure to the degree of similarity of grouping (visualized by color
coding) of data in each of two columns of an EXCEL spreadsheet. The mechanism
allows for a correlation-like analysis on a wide variety of data types, including
text, or mixed numbers and text.
In the data-exploration paradigm of the present invention, one of the first
steps a user takes is to divide the range of data values in each column into a small
number of groups for further analysis, thus effecting a reduction of precision
which has been found to be useful in various ways.
It is sometimes useful to explore whether the rows of the data matrix have
been divided into similar groupings in each of two different columns. For
example, a researcher might ask, "Do the high molecular weight compounds tend
to be the ones whose solubilities fall below the limits of measurement?" In other
words, this would mean to compare the groupings in the molecular weight column
with the groupings in the solubility column.
If the data were strictly quantitative, this would be called correlation of
variables, and there exist a number of perfectly good statistical measures of the
phenomenon. However, one of the unique capabilities of the present invention
lies in dealing with textual data and mixtures of numbers and text, and it would be
helpful if one could translate the visible color patterns of the present invention to
some kind of quantitative measure of correlation. In order to avoid confusion with
standard statistical correlation, the distinct term "color grouping similarity" is used
to describe the new measure. In preferred embodiments of this invention, the data grouping is stored in
the form of the colored backgrounds of data cells. At first glance, one might
simply seek to compare the colors of the first column with those of the second,
and count the number of rows with matching colors. However, a color grouping
similarity tool must be able to cope with the possibility that the colors are
different. This could happen because the user chooses completely different color
schemes for the two columns, or because the correlation is negative. As an
example of negative correlation, suppose column A contains random numbers
between 0 and 1, colored such that those above 0.5 are green and those below 0.5
are red. Then imagine a column B where each value is equal to one minus the
corresponding value in column A, i.e., the "one's complement." If the user colors
the second column with exactly the same coloring rule as the first, every row will
have a different color in column B than in column A. None of the colors will
match, though the groupings are perfectly correlated. To be successful, the tool
must deliver a high measure of correlation between such pairs of columns. The
algorithm described below was designed to perform in this way.
Algorithm for Measuring Data Grouping Similarity
The algorithm was derived from semi-quantitative reasoning, as follows.
It is based on the qualitative question, "For all rows that have one particular color
in the first column, to what degree do they have a uniform color in the second
column (not necessarily the same color as in the first column)?" The quantified
answer to this question is then averaged over the set of colors used. The details of the mechanism can be seen by example. First, to compare
the grouping in two columns A and B, a matrix of "ordered color pair counts"
(OCPC) is defined such that each matrix element OCPC,y is the count of spreadsheet rows where one finds color . in spreadsheet column A and color y in
spreadsheet column B. Then, the rows of the two spreadsheet columns are
scanned to count the number of occurrences of each ordered color pair and thus to determine the values of the matrix elements.
In the discussion below, carefully distinguish the rows and columns of the
user's spreadsheet of data from the rows and columns of the derived OCPC
matrix.
If the two spreadsheet columns had exactly the same coloring, the nonzero
elements of the OCPC matrix would all be on the diagonal. As a simplified example, consider four colors (green, yellow, orange, red) and a total of 16 data
rows. The diagonal matrix might be (zero elements left blank for emphasis)
color in column A color in column B — > green yellow orange red green 4 yellow 2 orange 7 red 3
In the case above, there are groups of 4 spreadsheet rows colored green (in
both columns), 2 rows colored yellow, 7 rows colored orange, and 3 rows colored
red.
In contrast, if the groupings were the same, but the coloring rule for the
second spreadsheet column used the same colors in a different order, the OCPC
matrix might look like the following, no longer diagonal: color in column A color in column B — » green yellow orange red green 4 yellow 2 orange 7 red 3
A simple extension applies if entirely different colors (cyan, blue, maroon, purple) are used in the second spreadsheet column. The OCPC matrix might then
be:
color in column A color in column B — > cyan blue maroon purple green 4 yellow 2 orange 7 red 3
In any of the three cases above, the groupings are identical, and the OCPC matrices have the property that each matrix row and matrix column has only one nonzero element. That lone element is necessarily equal to the sum of the row or
column. This situation should receive the highest similarity score.
One way to define the contrasting situation that would deserves the lowest similarity score would be that for each user-defined group of spreadsheet rows in one spreadsheet column, the other spreadsheet column has a "maximally non- unique" set of colors. In the corresponding OCPC matrix, this corresponds to
each matrix row or column having a broad distribution of values rather than a single non-zero, a uniform distribution has been chosen as the definition of this state: color in column A color in column B — > cyan blue maroon purple green 1 1 1 1 yellow 1 1 1 1 orange 1 1 1 1 red 1 1 1 1
This low-similarity state can be more precisely defined by saying that each
element in a given matrix row or column is the average of all the counts in that
matrix row or column.
With these concepts defined and the OCPC matrix filled, the scores can
then be derived. Each OCPC matrix row (corresponding to a color group in the
first spreadsheet column) is selected in turn for scoring. Each element in the
matrix row is given a score between zero and one, according to its linear
interpolation between: on the one extreme, the average of the nonzero elements in
the row, and on the other, the sum of the row or column (i.e., the maximum value
it could have if all the others were zero). The scores are then averaged over all the
rows of the OCPC matrix to generate a row- wise score component.
Next, the corresponding process is applied to the columns of the OCPC
matrix (each corresponding to a color group in the second spreadsheet column
rather than the first). The resulting column-wise score component is averaged
with the row- wise score component, then the average is scaled to a maximum of
100 to generate the final similarity score for the two spreadsheet columns.
Interpretation of the Similarity Scores
Although the scores are quantitative and well-defined, their interpretation
is best done in a partly subjective manner, based on experience. The behavior of the scores is best understood by example. In FIGURE 25, the leftmost ("base")
column has been compared to each of the others, and the scores are shown as well
as pictures of the grouping patterns. Comparison of the base with the next column
shows that the tool delivers a maximal score of 100 for identical grouping, even when the colors are completely different. Then, stepping across the figure toward
the right, it can be seen how the score decreases as the grouping pattern gradually becomes less similar to that of the base column. All the way down to a similarity score of 40, it is still basically true that the light colors are on top and the dark on
the bottom, with increasing "noise," but when the score falls to 20, the pattern
appears to have no correspondence to that of the base.
Implementation of the Data Grouping Similarity Tool
In practice, the tool allows the user to choose two sets of spreadsheet data columns. The program then automatically generates all pairs containing a column from the first set with a column from the second set, then writes the similarity scores onto a newly inserted spreadsheet in the user's workbook. The output takes the form of a table where the degree of similarity is itself color-coded to aid the
user in identifying significant cases. An example appears FIGURE 26.
While the invention has been described with reference to particular
mechanisms (algorithms, processes and functions) and architectures, one skilled in the art would realize that other mechanisms and/or architectures could be used while still achieving the invention.
While embodiments of the present invention have been described with
particular setup and initialization procedures, other setup and/or initialization
procedures can be used.
Further, while many of the operations have been shown as being performed in a particular order, one skilled in the art would realize that other
orders, including some parallelization of operations, are possible and are
considered to be within the scope of the invention.
While the present invention has been described with reference to analysis and pattern recognition in data sets relating to chemical compounds, the methods, systems and devices of this invention are considered to be general constructs covering other, non-chemical data sets.
Thus, are provided methods, systems and devices for analysis and pattern
recognition in large, multidimensional data sets using low-resolution data grouping. One skilled in the art will appreciate that the present invention can be practiced by other than the described embodiments, which are presented for purposes of illustration and not limitation, and the present invention is limited only by the claims that follow.

Claims

What is claimed is:
1. A method of operating on data, the method comprising: providing at least one user-defined grouping rule for grouping the data into
a user-definable number of groups; and applying at least one of the grouping rules to the data.
2. A method as in claim 1 wherein the data are provided in a table and
wherein the at least one grouping rule applies to at least one user-selectable
column of the table.
3. A method as in claim 1 wherein the at least one grouping rule
defines breakpoints corresponding to the user-definable number of groups, and wherein application of the at least one rule to the data divides the data into groups
based on the breakpoints.
4. A method as in claim 1 further comprising: presenting the grouped data in a manner that visually distinguishes the groups.
5. A method as in claim 4 wherein the grouping rules associate colors with groups and wherein the presenting of the grouped data further comprises: coloring an aspect of the data according to the rules.
6. A method as in claim 4, wherein the data are in labeled columns in
a spreadsheet, and wherein the at least one grouping rule specifies at least one breakpoint and a corresponding color for each at least one breakpoint, and wherein
the presenting of the grouped data comprises:
coloring each data item in the at least one labeled column of the data based on the at least one breakpoint and the corresponding color of the at least one
breakpoint.
7. A method as in any one of claims 3 and 6, wherein the breakpoints are selected from: (a) numeric values; and (b) textual values.
8. A method as in claim 3 wherein the at least one breakpoint is determined automatically based on the data.
9. A method as in claim 5 wherein the data are provided in a table, wherein the coloring of an aspect of the data comprises: coloring backgrounds of table cells according to the rules.
10. A method as in claim 1 wherein the number of groups is fewer than a number of possible data values.
11. A method of operating on data, the method comprising: providing at least one user-defined grouping rule for grouping the data into a user-definable number of groups; applying at least one of the grouping rules to the data to generate grouped
data; providing at least one user-defined scoring rule for scoring the grouped
data according to user-defined scores; and applying at least one of the scoring rules to the grouped data to score the
grouped data.
12. A method of operating on data, the method comprising:
generating grouped data by applying to the data at least one user-defined grouping rule for grouping the data into a user-definable number of groups; and scoring the grouped data by applying to the grouped data at least one user- defined scoring rule for scoring the grouped data according to user-defined scores.
13. A method according to claim 11 or 12 wherein the data comprises a number of parameters for each of a number of cases and the scoring rule comprises a scoring function of user-selectable parameters and user-defined
weights for the selected parameters to be used in scoring the cases, wherein the scoring of the grouped data comprises:
applying the function to the data to obtain a score for each case.
14. A method according to claim 13, further comprising: sorting the scored cases by score.
15. A method according to claim 14, wherein the scored cases are
sorted individually.
16. A method according to claim 14, wherein the scored cases are
sorted by cluster.
17. A system for operating on data, the system comprising: a mechanism constructed and adapted to provide at least one user-defined
grouping rule for grouping the data into a user-definable number of groups; and a mechanism constructed and adapted to apply at least one of the grouping
rules to the data.
18. A system as in claim 17 wherein the data are provided in a table
and wherein the at least one grouping rule applies to at least one user-selectable
column of the table.
19. A system as in claim 17, wherein the at least one grouping rule
defines breakpoints corresponding to the user-definable number of groups, and
wherein application of the at least one rule to the data divides the data into groups
based on the breakpoints.
20. A system as in claim 17, further comprising:
a mechanism constructed and adapted to present the grouped data in a
manner that visually distinguishes the groups.
21. A system as in claim 20, wherein the grouping rules associate colors with groups and wherein the mechanism constructed and adapted to present
the grouped data further comprises: a mechanism constructed and adapted to color an aspect of the data
according to the rules.
22. A system as in claim 20, wherein the data are in labeled columns in
a spreadsheet, and wherein the at least one grouping rule specifies at least one breakpoint and a corresponding color for each at least one breakpoint, and wherein
the mechanism constructed and adapted to present the grouped data comprises:
a mechanism constructed and adapted to color each data item in the at least one labeled column of the data based on the at least one breakpoint and the corresponding color of the at least one breakpoint.
23. A system as in any one of claims 19 and 22, wherein the
breakpoints are selected from: (a) numeric values; and (b) textual values.
24. A system as in claim 19 further comprising:
a mechanism constructed and adapted to determine at least one breakpoint
automatically, based on the data.
25. A system as in claim 21 wherein the data are provided in a table, wherein the mechanism constructed and adapted to color an aspect of the data comprises: a mechanism constructed and adapted to color backgrounds of table cells
according to the rules.
26. A system as in claim 17 wherein the number of groups is fewer
than a number of possible data values.
27. A system of operating on data, the system comprising:
a mechanism constructed and adapted to provide at least one user-defined
grouping rule for grouping the data into a user-definable number of groups;
a mechanism constructed and adapted to apply at least one of the grouping
rules to the data to generate grouped data;
a mechanism constructed and adapted to provide at least one user-defined
scoring rule for scoring the grouped data according to user-defined scores; and
a mechanism constructed and adapted to apply at least one of the scoring
rules to the grouped data to score the grouped data.
28. A system of operating on data, the system comprising:
a mechanism constructed and adapted to generate grouped data by
applying to the data at least one user-defined grouping rule for grouping the data
into a user-definable number of groups; and
a mechanism constructed and adapted to score the grouped data by
applying to the grouped data at least one user-defined scoring rule for scoring the
grouped data according to user-defined scores.
29. A system according to claim 27 or 28 wherein the data comprises a
number of parameters for each of a number of cases and the scoring rule
comprises a scoring function of user-selectable parameters and user-defined
weights for the selected parameters to be used in scoring the cases, wherein the
mechanism constructed and adapted to score of the grouped data comprises:
a mechanism constructed and adapted to apply the function to the data to
obtain a score for each case.
30. A system according to claim 29, further comprising:
a mechanism constructed and adapted to sort the scored cases by score.
31. A system according to claim 30, wherein the scored cases are
sorted individually.
32. A system according to claim 30, wherein the scored cases are
sorted by cluster.
33. A computer-readable memory medium encoded with program data
representing a computer program that can cause a computer to implement a
method of operating on data, the method comprising:
providing at least one user-defined grouping rule for grouping the data into
a user-definable number of groups; and
applying at least one of the grouping rules to the data.
34. A medium as in claim 33 wherein the data are provided in a table and wherein the at least one grouping rule applies to at least one user-selectable
column of the table.
35. A medium as in claim 33 wherein the at least one grouping rule
defines breakpoints corresponding to the user-definable number of groups, and
wherein application of the at least one rule to the data divides the data into groups
based on the breakpoints.
36. A medium as in claim 33, wherein the method further comprises:
presenting the grouped data in a manner that visually distinguishes the
groups.
37. A medium as in claim 36 wherein the grouping rules associate
colors with groups and wherein the presenting of the grouped data further
comprises: coloring an aspect of the data according to the rules.
38. A medium as in claim 36, wherein the data are in labeled columns
in a spreadsheet, and wherein the at least one grouping rule specifies at least one
breakpoint and a corresponding color for each at least one breakpoint, and wherein
the presenting of the grouped data comprises:
coloring each data item in the at least one labeled column of the data based
on the at least one breakpoint and the corresponding color of the at least one
breakpoint.
39. A medium as in any one of claims 35 and 38, wherein the
breakpoints are selected from: (a) numeric values; and (b) textual values.
40. A medium as in claim 35 wherein the at least one breakpoint is
determined automatically based on the data.
41. A medium as in claim 37 wherein the data are provided in a table, wherein the coloring of an aspect of the data comprises:
coloring backgrounds of table cells according to the rules.
42. A medium as in claim 33 wherein the number of groups is fewer
than a number of possible data values.
43. A computer-readable memory medium encoded with program data
representing a computer program that can cause a computer to implement a
method of operating on data, the method comprising:
providing at least one user-defined grouping rule for grouping the data into
a user-definable number of groups;
applying at least one of the grouping rules to the data to generate grouped
data;
providing at least one user-defined scoring rule for scoring the grouped
data according to user-defined scores; and
applying at least one of the scoring rules to the grouped data to score the
grouped data.
44. A computer-readable memory medium encoded with program data
representing a computer program that can cause a computer to implement a
method of operating on data, the method comprising:
generating grouped data by applying to the data at least one user-defined
grouping rule for grouping the data into a user-definable number of groups; and
scoring the grouped data by applying to the grouped data at least one user-
defined scoring rule for scoring the grouped data according to user-defined scores.
45. A medium according to claim 43 or 44, wherein the data comprises
a number of parameters for each of a number of cases and the scoring rule
comprises a scoring function of user-selectable parameters and user-defined
weights for the selected parameters to be used in scoring the cases, wherein the
scoring of the grouped data comprises:
applying the function to the data to obtain a score for each case.
46. A medium according to claim 44, the method further comprising:
sorting the scored cases by score.
47. A medium according to claim 46, wherein the scored cases are
sorted individually.
48. A medium according to claim 46, wherein the scored cases are sorted by cluster.
PCT/US2000/020401 1999-07-27 2000-07-27 Analysis and pattern recognition in large, multidimensional data sets using low-resolution data grouping WO2001008039A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP00950750A EP1210670A2 (en) 1999-07-27 2000-07-27 Analysis and pattern recognition in large, multidimensional data sets using low-resolution data grouping
JP2001513058A JP2003505790A (en) 1999-07-27 2000-07-27 Analysis and pattern recognition of a series of large volumes of multidimensional data using low-resolution data partitioning
AU63809/00A AU6380900A (en) 1999-07-27 2000-07-27 Analysis and pattern recognition in large, multidimensional data sets using low-resolution data grouping

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US36112299A 1999-07-27 1999-07-27
US09/361,122 1999-07-27

Publications (3)

Publication Number Publication Date
WO2001008039A2 true WO2001008039A2 (en) 2001-02-01
WO2001008039A3 WO2001008039A3 (en) 2001-03-22
WO2001008039A9 WO2001008039A9 (en) 2002-08-08

Family

ID=23420737

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/020401 WO2001008039A2 (en) 1999-07-27 2000-07-27 Analysis and pattern recognition in large, multidimensional data sets using low-resolution data grouping

Country Status (4)

Country Link
EP (1) EP1210670A2 (en)
JP (1) JP2003505790A (en)
AU (1) AU6380900A (en)
WO (1) WO2001008039A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004011939A1 (en) * 2002-07-30 2004-02-05 Amersham Biosiciences Uk Limited Perturbation effect recognition in a biological system
US7006994B1 (en) 1999-07-16 2006-02-28 American Management Systems, Inc. Automated receivables management system
US11308134B2 (en) 2017-06-09 2022-04-19 Microsoft Technology Licensing, Llc Automatic analysis of difference between multi-dimensional datasets

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Dynamic Layout Mechanism for the Massive-Node Server Status Monitor" IBM TECHNICAL DISCLOSURE BULLETIN, vol. 36, no. 5, 1 May 1993 (1993-05-01), pages 169-170, XP000408951 New York, US *
EISEN M B ET AL: "Cluster analysis and display of genome-wide expression patterns" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA,US,NATIONAL ACADEMY OF SCIENCE. WASHINGTON, vol. 95, December 1998 (1998-12), pages 14863-14868, XP002140966 ISSN: 0027-8424 *
STANTON D T ET AL: "Application of nearest-neighbor and cluster analyses in pharmaceutical lead discovery" JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, JAN.-FEB. 1999, ACS, USA, vol. 39, no. 1, pages 21-27, XP000971515 ISSN: 0095-2338 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7006994B1 (en) 1999-07-16 2006-02-28 American Management Systems, Inc. Automated receivables management system
WO2004011939A1 (en) * 2002-07-30 2004-02-05 Amersham Biosiciences Uk Limited Perturbation effect recognition in a biological system
GB2406195A (en) * 2002-07-30 2005-03-23 Amersham Biosciences Uk Ltd Perturbation effect recognition in a biological system
US11308134B2 (en) 2017-06-09 2022-04-19 Microsoft Technology Licensing, Llc Automatic analysis of difference between multi-dimensional datasets

Also Published As

Publication number Publication date
WO2001008039A3 (en) 2001-03-22
EP1210670A2 (en) 2002-06-05
JP2003505790A (en) 2003-02-12
WO2001008039A9 (en) 2002-08-08
AU6380900A (en) 2001-02-13

Similar Documents

Publication Publication Date Title
US6211887B1 (en) System and method for data visualization
Gardener Community ecology: analytical methods using R and Excel
Kincaid et al. Line graph explorer: scalable display of line graphs using focus+ context
Pirolli et al. Table lens as a tool for making sense of data
EP1524612A2 (en) Displaying and manipulating data
US20210272029A1 (en) Context aware recommendation of analytic components
US6868342B2 (en) Method and display for multivariate classification
US6199017B1 (en) Biochemical information processing apparatus, biochemical information processing method, and biochemical information recording medium
KR20030081416A (en) System for facilitating selection of investments
US6954908B2 (en) Circuit design point selection method and apparatus
US6927779B2 (en) Web-based well plate information retrieval and display system
EP1210670A2 (en) Analysis and pattern recognition in large, multidimensional data sets using low-resolution data grouping
JP5020491B2 (en) NMR data processing apparatus and method
JP5415476B2 (en) NMR data processing apparatus and method
US7930108B2 (en) Exploratory visualization of protein complexes by molecular weight
US6947937B1 (en) Data selection system and method therefor
Lawrence et al. explorase: Exploratory data analysis of systems biology data
Wills Visual exploration of large structured datasets
JPH07262172A (en) Data analyzer
WO2003079232A2 (en) Credit data visualisation system and method
JPH04343051A (en) Data analysis system of electron beam microanalyzer
JP7380870B2 (en) Masking device, masking method and masking program
WO2023209421A1 (en) Sediment analysis software along with scanning electron microscope toolbox for image processing
Bandi et al. Supporting Visual Comparison and Pattern Identification in Widescale Genomic Datasets
JMP et al. Essential Graphing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2000950750

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 2000950750

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 10048022

Country of ref document: US

AK Designated states

Kind code of ref document: C2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1/64-64/64, DRAWINGS, REPLACED BY NEW PAGES 1/64-64/64; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

WWW Wipo information: withdrawn in national office

Ref document number: 2000950750

Country of ref document: EP