MXPA00010727A - System, method,and computer program product for representing proximity data in a multi-dimensional space - Google Patents

System, method,and computer program product for representing proximity data in a multi-dimensional space

Info

Publication number
MXPA00010727A
MXPA00010727A MXPA/A/2000/010727A MXPA00010727A MXPA00010727A MX PA00010727 A MXPA00010727 A MX PA00010727A MX PA00010727 A MXPA00010727 A MX PA00010727A MX PA00010727 A MXPA00010727 A MX PA00010727A
Authority
MX
Mexico
Prior art keywords
objects
distance
correlations
pairs
correlation
Prior art date
Application number
MXPA/A/2000/010727A
Other languages
Spanish (es)
Inventor
Francis R Salemme
Dimitris K Agrafiotis
Victor S Lobanov
Original Assignee
3Dimensional Pharmaceuticals Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 3Dimensional Pharmaceuticals Inc filed Critical 3Dimensional Pharmaceuticals Inc
Publication of MXPA00010727A publication Critical patent/MXPA00010727A/en

Links

Abstract

A system, method and computer program product for representing precise or imprecise measurements of similarity/dissimilarity (relationships) between objects as distances between points in a multi-dimensional space that represents the objects. Self-organizing principles are used to iteratively refine an initial (random or partially ordered) configuration of points using stochastic relationship/distance errors. The data can be complete or incomplete (i.e. some relationships between objects may not be known), exact or inexact (i.e. some or all of the relationships may be given in terms of allowed ranges or limits), symmetric or asymmetric (i.e. the relationship of object A to object B may not be the same as the relationship of B to A) and may contain systematic or stochastic errors. The relationships between objects may be derived directly from observation, measurement, a priori knowledge, or intuition, or may be determined indirectly using any suitable technique for deriving proximity (relationship) data. The present invention iteratively analyzes sub-sets of objects in order to represent them in a multi-dimensional space that represent the objects. In an exemplary embodiment, the present invention iteratively analyzes sub-sets of objects using conventional multi-dimensional scaling or non-linear mapping algorithms. In another exemplary embodiment, relationships are defined as pair-wise relationships or pair-wise similarities/dissimilarities between pairs of objects and the present invention iteratively analyzes a pair of objects at a time. Preferably, sub-sets are evaluated pair-wise, as a double-nested loop.

Description

SYSTEM, METHOD AND PRODUCT OF COMPUTER PROGRAM FOR REPRESENT PROXIMITY DATA IN A SPACE MU TIDIMENSIONAL BACKGROUND OF THE INVENTION Field of the Invention The present invention is directed to data analysis and, more particularly, to the representation of proximity data in the multidimensional space.
Related Technique The multidimensional scaling techniques (MDS) and mapping or non-linear coordinate mapping (NLM) are techniques for generating display maps, including non-linear maps, of objects where the distances between the objects represent the correlations between the objects. The MDS and NLM were introduced by Torgerson, Phychometrika, 17: 401 (1952); Kruskal, Psychometrika, 29: 115 (1964); and Sammon, IEEE Trans. Comput. ', C-18: 401 (1969) as a means to generate lower dimensional or low level Ref.124494 representations of psychological data. Multidimensional scaling and nonlinear coordinate mapping or mapping are reviewed in Schiffman, Reynolds and Young, Introduction to Multidimensional Scaling, Academic Press, New York (1981); Young and Hamer, Multidimensional Scaling: History, Theory and Applications, Erlbaum Associates, Inc., Hillsdale, NJ (1987); and Cox and Cox, Multidimensional Scaling, Number 59 in Monographs in Statistics and Applied Probability, Chapman-Hall (1994). The contents of these publications are incorporated here for reference in their entirety. The MDS and NLM (these are generally the same, and are thereafter referred to collectively as MDS) represent a collection of methods to visualize the proximity correlations of the objects by the distances of the points in a lower-dimensional Euclidean space or of low level. Proximity measurements are reviewed in Hartigan, J. Am. Statis. Ass., 62: 1140 (1967), which is incorporated herein for reference in its entirety. In particular, given a finite set of vector samples or other samples A =. { a ±, i = 1, ..., k} , a function of relation r_j = r (ai, a-¡), with ai, aj e A, which measures the similarity or dissimilarity between the i-th and jth objects in A, and a set of images X = . { x ±, ..., jc; x and Í m} of A on a m-dimensional display plane (9ím is the space of all the m-dimensional vectors of real numbers), the objective is to place x_ on the exhibition plane in such a way that its Euclidean distances dih = I | x_ - X | | approach as closely as possible to the corresponding r_j values. This projection, which in many cases can only be done in an approximate way, is carried out in an iterative way by minimizing an error function which measures the difference between the matrices of the original distance, r_j, and projected, di, of the sets of original and projected vectors. Several such error functions have been proposed, most of which are of the least squares type, including the "stress" of Kruskal: The error criterion of Sa mon: and the deviation or distance coefficient of Lingoes where d_j = I | x¿ - Xj i I is the Euclidean distance between the images x ± and Xj on the plane of display. In general, the solution is found in an iterative way by: (1) calculating or retrieving from a database the relation rij '* (2) start the images x ±; (3) calculate the distances of the images d ± j and the value of the error function (for example S, E or K in Eqs. 1-3 above); (4) calculating a new configuration of the Xi images using a gradient descent procedure, such as a linear Kruskal regression or a permutation of the Guttman image order; and (5) repeat steps 3 and 4 until the error is minimized within some prescribed tolerance. For example, the Sammon algorithm minimizes Eq. 2 by updating the x_ coordinates using Eq. 4: x mX ^ xXm? AXm) P m Ec. 4 where m is the iteration number, xpq is the qth coordinate of the xp p-th image,? is the speed of learning, and The partial derivatives in Eq. 5 are given by: The mapping or assignment of coordinates is obtained by the repeated evaluation of Eq. 2, followed by the modification of the coordinates using Eq. 4 and 5, until the error is minimized within a prescribed tolerance. The above general refinement model is suitable for relatively small data sets but has a significant limitation that makes it impractical for large data sets. This limitation stems from the fact that the calculation effort required to calculate the gradients (ie, step (4) above) is scaled or increased to the square of the size of the data set. For relatively large data sets, this quadratic time complexity makes partial refinement difficult. What is needed is a system, method and product of a computer program to represent proximity data in a multidimensional space, which scales favorably with the number of objects and which can be applied to both large and small data sets. . In addition, what is necessary is a system, method and product of a computer program that can be effective with erroneous data and / or data that contains uncertainty, noise or interspersed or related errors.
Brief Description of the Invention The present invention is a system, method and computer program for representing precise or imprecise measurements of similarity / dissimilarity (correlations) between objects preferably as the distances between points in a multidimensional space representing the objects. The algorithm uses principles of self-organization to iteratively refine an initial configuration (randomly or partially ordered) of the points using distance or casual or probabilistic relationship errors. The data can be complete or incomplete (ie some correlations between the objects can not be known), exact or inaccurate (ie some or all of the correlations can be given in the terms of the allowed intervals or limits), symmetric or Asymmetric (that is, the correlation of object A to object B can not be the same as the relation of B with respect to A) and may contain systematic or probabilistic errors.
The correlations between the objects can be derived directly from observation, measurement, prior knowledge, or intuition, can be determined directly or indirectly using any suitable technique to derive the proximity data (correlation). The present invention iteratively analyzes the subsets of the objects to represent them in a multidimensional space that represents the correlations between the objects. In an exemplary embodiment, the present invention iteratively analyzes the subsets of the objects using nonlinear or conventional multidimensional scaling algorithms. In another exemplary embodiment, the correlations are defined as pair correlations or similarities / dissimilarities of pairs between the pairs of the objects and the present invention iteratively analyzes a pair of objects at a time. Preferably, the subsets are evaluated in pairs, as a double inclusion closed loop. In the following description, the terms correlation, similarity or dissimilarity are used to denote a relationship between a pair of objects. The term exhibition map is used to denote a collection of images over a n-dimensional space that represents the original objects. The term distance is used to denote a distance between the images on an exhibition map that corresponds to the objects. Examples of the present invention are provided, including the examples of the present invention implemented with the data and the correlations of the chemical compounds. It is to be understood, however, that the present invention is not limited to the examples presented herein. The present invention can be implemented in a variety of applications. For example, although the specific embodiment described here uses the distances between the points to represent the similarity / dissimilarity between the objects, the invention is proposed and adapted to use any display attribute to represent the similarity / dissimilarity between the objects, including but not be limited to, the font, the size, the color, the scale of the grays, the italic letters, the underline, the bold letters, the contour lines, the margins, etc. For example, the similarity / dissimilarity between the two objects can be represented by the relative size of the points representing the objects. The additional features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the appended drawings.
Brief Description of the Figures The presentation of this patent contains at least one drawing made in color. Copies of this patent with color drawing (s) will be provided by the Patent and Trademark Office during the petition and payment of the necessary fees. The present invention will be described with reference to the appended drawings, wherein: Figure 1 illustrates a block diagram of a computing environment in accordance with an embodiment of the invention; Figure 2 is a block diagram of a computer useful for the implementation of the components of the invention; Figure 3 is a flow chart depicting the operation of the invention in visualization and interactive processing of display maps according to one embodiment of the invention; II Figure 4 is a flowchart representing the manner in which an exhibit map is generated according to one embodiment of the invention, Figure 5 conceptually illustrates the correlations between the objects, wherein the correlations are known within certain tolerances; Figure 6 is a block diagram of a system for representing the correlations between the objects; and Figure 7 is a process flow diagram illustrating a method for representing the correlations between the objects. In the drawings, like reference numbers indicate identical or functionally similar elements. Also, the digit (s) furthest to the left of the reference numbers identify the drawings in which the associated elements are first entered.
Detailed Description of the Preferred Modalities Table of Contents I. Review of the Present Invention 13 II. Selection of Subset 16 III. Matrices of Relationship by Complete Pairs without Uncertainties 18 IV. Matrices of Relationship by Scattered Couples without Uncertainties 23 V. Relationship Matrices for Couples without Intercalated or Related Uncertainties 24 SAW. Matrices of Correlations by Pairs with Unrelated Accidentals (Altered Data) 29 VII. Modifications of the Basic Algorithm 30 VIII. Evaluation Properties (Characteristics), Correlations and Measurements of Distance .... 33 A. Evaluation of Properties that Have Continuous or Discrete Real Values 34 1. Correlations or Measurements of the distance in which the Values of the Evaluation Properties are Real or Discrete Real Numbers 36 B. Evaluation of Properties that Have Binary Values 37 1. Measurements of the distance in which the values of the evaluation properties are binary 39 C. Scaling of Evaluation Properties 42 IX. Implementation of the Invention 45 A. General 45 B. Implementation of the Invention in a Computer Program Product 50 C. Operation of the Present Invention 56 X. Example of the Invention 58 A. Operation of the Exemplary Modality 62 XI. Conclusions 68 I. Review of the Present Invention The present invention is a system, method and product of a computer program for representing precise or imprecise measurements of similarity / dissimilarity (correlations) between objects such as distances between points (or the use of other attributes or display techniques) in a multidimensional space which represents the objects. The algorithm uses the principles of self-organization to iteratively refine an initial configuration (randomly or partially ordered) of the points using the fortuitous correlations / errors of the distance. Data may be complete or incomplete (ie some correlations between objects may be unknown), exact or inaccurate (ie some or all correlations may be given in terms of allowable ranges or limits), symmetric or asymmetric (ie the correlation of object A with respect to object B may not be the same as the correlation of B with respect to A) or may contain systematic or probabilistic errors. The correlations between the objects can be derived directly from the observation, measurement, a priori knowledge, or intuition, or can be determined directly or indirectly using any suitable technique to derive the proximity data (correlation). The present invention iteratively analyzes the subsets of objects to represent them in a multidimensional space representing the objects. In an exemplary embodiment, the present invention iteratively analyzes subsets of objects using multidimensional scaling algorithms or mapping or mapping of conventional coordinates. In another exemplary embodiment, the correlations are defined as pair correlations, or if ilarities / dissimilarities of pairs between the pairs of objects and the present invention iteratively analyzes a pair of objects at a time. Preferably, the subsets are evaluated in pairs, as a double inclusion closed circuit.
In an alternative embodiment, the correlations are defined as correlations of N forms or similarities / dissimilarities of N forms among multiple objects, and the present invention iteratively analyzes multiple objects at the same time, wherein N is preferably greater than 1. The implementation of this alternative modality will be evident to the experts in the relevant art (s). The term "object" refers to any entity, data, property, attribute, component, element, ingredient, article, etc., where it could be useful to represent similarity / dissimilarity between examples of, or different from, any of such entities, data, properties, attributes, components, elements, ingredients, articles, etc. Without being intended as a limitation, but only by way of illustration, objects include, for example, chemical compounds, processes, machines, compositions of matter, articles of manufacture, electrical devices, mechanical devices, financial data, financial instruments, 'financial trends, traits and related financial characteristics, programming products, human traits and characteristics, scientific properties, traits, and characteristics, etc. In one embodiment, the invention operates with any entities, data, properties, attributes, components, elements, ingredients, articles, etc., except chemical compounds.
II. Sub-Co Selection The present invention iteratively analyzes the subsets of the objects to represent them in a multidimensional space representing the correlations between the objects. In an exemplary embodiment, the present invention iteratively analyzes the subsets of the objects using multidimensional scaling algorithms or conventional non-linear mapping or mapping. In this embodiment, the objects in a selected subset are analyzed as a group using a conventional algorithm, such as, but not limited to, those described above, for example. In particular, the coordinates of the images corresponding to the objects comprising this subset are refined using conventional multidimensional scaling, the mapping or assignment of nonlinear coordinates, or any other suitable algorithm, or the pairing refinement algorithm described below. In this mode, subsets of objects can be selected randomly, semi-randomly, systematically, partially systematically, etc. When the subsets of the objects are analyzed and their distances are revised, the set of objects tend to self-organize. In this way, large data sets can be accommodated with conventional multidimensional scaling or nonlinear mapping algorithms. In another exemplary embodiment, the correlations are defined as correlations by pairs or similarities / dissimilarities by pairs between pairs of objects and the present invention iteratively analyzes a pair of objects at a time. The pairs of objects can be selected randomly, semi randomly, systematically, partially systematically, etc. Algorithms and novel techniques for paired analysis are provided in the following sections. This modality is described only for illustrative purposes and is not limiting. In an alternative embodiment, the correlations are defined as correlations of N forms or of similarities / dissimilarities of N forms among the multiple objects, and the present invention iteratively analyzes the multiple objects at the same time, wherein N is preferably greater than 1. The Implementation of this alternative modality will become evident to the experts in the relevant technique (s).
III. Matrices of Correlations by Couples Competes without Uncertainties A preferred approach adopted here is to use an iterative refinement based on probabilistic or instantaneous errors. The description in this section assumes that all correlations in pairs are known, and all are accurate. As in traditional MDS, the method starts with an initial configuration of points generated at random or by some other procedure (see below). This initial configuration is then continuously refined by the repeated selection of two points i, j, at random, and by modifying their coordinates on the display map in accordance with Eq. 8. ?, (t + i) - ./rr,?//;,?/í;,r_ Eq. 8 where t is the current iteration, x? (t) and X;, (t) are the current coordinates of the i-th and j-th points on the display map, x? (t + l) are the new ones coordinates of the i-th point on the display map, and r_j is the relationship in pairs between the i-th and j-th objects that attempt to approach the exhibition map (see above). . / (.) in Eq. 8 above can assume any functional form. Ideally, this function should try to minimize the difference between the real and target distance between the i-th and j-th points. For example, / (.) Can be given by Eq. 9: where t is the iteration number, di-, = | | x_ (t) Xj (t) ||, and? (t) is an adjustable parameter, referred to hereinafter as the "learning speed", borrowed or taken from the terminology of the neural network. This process is repeated for a fixed number of cycles, or until some overall error criterion is minimized within some prescribed tolerance. Typically a large number of iterations are required to achieve statistical accuracy. The method described above is reminiscent of the backward propagation of the neural network (Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, PhD Thesis, Harvad University, Cambridge MA (1974), and Rummelhart and McClelland , Eds., Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, MIT Press, Cambridge, MA (1986)) and Kohonen's principle of self-organization (Kohonen, Biological Cybernetics, 43:59 (1982) ). The learning speed? (T) in Eq. 9 exhibits a key role in assuring convergence. Yes? it is too small, the updating of the coordinates is small, and the convergence is slow. Yes, on the other hand? it is too large, the learning speed can be accelerated, but the display map can become unstable (ie oscillatory). Typically,? it varies in the interval [0, 1] and can be fixed, or it can be reduced monotonically during the refinement process. Further, ? it can also be a function of i, j and / or rij, and can be used to apply different weights to certain objects and / or correlations. For example, ? It can be calculated by: -? ? { t) ^ (?.,? mix mins r.}. \ + ar "Eq. 10 where? _a_ and? - ¡- are the starting and ending learning speeds (uncompensated) such that? -ax, A-inß r ?, 11, t is the total number of refinement steps ( iterations), t is the common iteration number, it is already a constant scaling factor. Eq. 10 and 11 have the effect of reducing the correction to large separations, thus creating an exhibit map which preserves the interactions of the short interval more accurately than some of the long intervals. The weighting is described in greater detail later. One of the main advantages of this approach is that it enables partial refinements. Often it is sufficient that the correlations of pairs are represented only approximately to reveal the structure and general topology of the data. Unlike traditional MDS, this approach allows a very fine control of the refinement process. In addition, when the exhibition map is self-organizing, refinements in pairs become cooperative, which partially alleviates the quadratic nature of the problem.
The intercalation procedure described above does not guarantee convergence to the global minimum (ie the most accurate collation in the least squares sense). If so desired, the refinement process may be repeated a number of times from different starting configurations and / or random number sowings. In general, the absolute coordinates on the display map do not carry a physical meaning. What is important are the relative distances between the points, and the structure and general topology of the data (presence, density and separation of the groups, etc.). The method described above is ideally suited for both metric and non-metric scaling. The latter is particularly useful when the correlations of couples do not obey the postulates of distance and, in particular, the triangular inequality. Although an "exact" projection is only possible when the matrix of the relationship by pairs is positively defined, significant maps can still be obtained even when this criterion is not satisfied. As mentioned above, the total quality of the projection is determined by an error function of the sum of the squares such as that shown in Eq. 1-3.
The general algorithm described above can also be applied when the matrix of pairwise correlations is incomplete, that is, when some of the correlations by pairs are unknown, when some of the correlations by pairs are uncertain or are altered, or both of the previous ones. . These cases are described separately later.
IV. Matrices of Relationship by Scattered Couples without Uncertainties The general algorithm described above can also be applied when the matrix of pairwise correlations is incomplete, ie when some of the pairwise correlations are unknown. In this case, an algorithm similar to one described above can be used, with the exception that the algorithm iterates over the pairs of points for which the correlations are known. In this case, the algorithm identifies the configurations in the space that satisfy the correlations by known pairs; the correlations by unknown pairs adapt during the course of refinement and eventually assume value leading to a satisfactory intercalation of the unknown correlations.
Depending on the number of erroneous data, there is more than one of the successful collations (mappings) of the original correlation matrix. In this case, different configurations (maps) can be derived from different starting configurations or sowing random numbers. In some applications such as the search for the conformational space of molecules, this feature provides a significant advantage over some alternative techniques. All variants of the original algorithm (see Sections below) can be used in this context.
V. Matrices of Correlations for Couples with Intercalated or Related Uncertainties The general algorithm described above can also be applied when pairwise correlations contain intercalated or related uncertainties, that is, when some of the pairwise correlations are only known within certain fixed tolerances (for example, "correlations are known to be within a range or set of intervals with prescribed upper and lower limits.) In this case, an algorithm similar to the one described above may be used, with the exception that the distances on the display map are corrected only when the corresponding points are considered. For example, suppose that the relationship between two objects, i and j, is given in terms of an upper and lower limit, rmax and r-in, respectively, when this pair of objects is selected during the course of refinement. , the distance of the corresponding images on the exhibition map is calculated, and is denoted as dij. If dij is larger than r_aX, the coordinates of the images are updated using rmay as the target or target distance (Eq. 12): xl (t + l) = fO, x / t). * /, rm Ec.12 Conversely, if dij is smaller than rmin, the coordinates of the images are updated using r-in as the target or target distance (Eq. 13): *. (t + U - fft, * trt) .x / t), rn?) EC, 13 If dij is between the upper and lower limits (ie if rmi "<d ± j < rmax), no correction is made. In other words, the algorithm tries to match the upper limit if the current distance between the images is greater than the upper limit, or the lower limit if the current distance between the images is lower than the lower limit. If the distance between the images is within the upper and lower limits, no correction is made. This algorithm can be extended in the case where some of the correlations in pairs are given by a finite set of allowed discrete values, or by a set of ranges of values, or some combination thereof. For the purposes of the subsequent description, discrete values are considered as the intervals of a zero width (for example the discrete value of 2 can be represented as the interval [2.2]). Various capabilities for a unique hypothetical pair relationship and the current distance of the corresponding images on the display map are illustrated in Figure 5, where the shaded areas 510, 512 and 514 denote allowed ranges for a given pairwise relationship. The distances D1-D5 illustrate 5 different possibilities for a current distance between the corresponding images on the display map. The arrows 516, 518, 520 and 522 indicate the direction of the correction that must be applied to the images on the map. Arrows 518 and 522 point to the left, indicating that the coordinates of the associated images on the display map must be updated so that the images become closer together. Arrows 516 and 520 point to the right, indicating that the coordinates of the associated images must be updated so that the images become more distant. As in the case of a single interval, if the current distance of a selected pair of images on the display map lies within any of the prescribed ranges, no coordinate update is carried out (ie, the case di in the Figure 5). If not, the correction is applied using the closest interval limit as the target or target distance (ie, cases d2-d5 in Figure 5). For example, if the relationship between a given pair of objects lies in the intervals [1,2], [3,5] and [6,7] and the current distance of the respective images is 2.9 (d5 in Figure 5) , the correction is carried out using 3 as the target or target distance (rij) in Eq. 8. However, if the current distance is 2.1, the coordinates are updated using 2 as the target or target distance (rij). ) in Eq. 8. This deterministic criterion can be replaced by a stochastic or probabilistic one in which the target or target distance is selected either randomly or with a probability that depends on the difference between the current distance and the two limits of the nearest interval. In the example described above (d5 in Figure 5), a probabilistic choice between 2 and 3 as the target or target distance could be made, with the probabilities of, for example, 0.1 and 0.9, respectively (ie, 2). it could be selected as the distance of target or target with the probability of 0.1, and 3 with the probability of 0.9). Any method can be used to derive such probabilities. Alternatively, either 2 or 3 could be chosen as the target distance or random target. For example, uncertainties interspersed or linked in pairwise correlations may represent probabilistic or systematic errors or noise associated with a physical measurement, and may differ in general from one pairwise relationship to another. A typical example is the Effects of Nuclear Overhauser (NOE's) in Multidimensional Nuclear Magnetic Resonance Spectrometry. An alternative algorithm for dealing with uncertainties is to reduce the magnitude of the correction for the pairs of objects whose relationship is thought to be uncertain. In this scheme, the magnitude of the correction, as determined by the learning speed in Eq. 9, for example, is reduced by the correlations in pairs which are thought to be uncertain. The magnitude of the correction may depend on the degree of uncertainty associated with the corresponding pairwise relationship (for example, the magnitude of the correction may be inversely proportional to the uncertainty associated with the corresponding pairwise relationship). If the existence and / or the magnitude of the errors is unknown, then the errors can be determined automatically by the algorithm. (See Section V below).
SAW. Relationship Matrices for Couples with Unbound Uncertainties (Altered Data) The ideas described in the preceding Sections can be applied when some of the pairwise correlations are thought to contain altered data, which is when some of the pairwise correlations are incorrect and essentially do not bear any relation to the current values. In this case, the "problematic" correlations can be detected during the course of the algorithm, and removed from the subsequent processing. In other words, the goal is to identify the altered entries and remove them from the correlation matrix. This process leads to a dispersed correlation matrix, which can be refined using the algorithm in Section 1.2 above.
VXI Modifications of the Basic Algorithm In many cases, the algorithm described above can be accelerated by pre-ordering the data using an appropriate statistical method. For example, if the proximities are derived from the data that is available in vector or binary form, the initial configuration of the points on the display map can be calculated using the Principal Component Analysis. In a preferred embodiment, the initial configuration can be constructed from the first 3 main components of the characteristics matrix (ie the 3 latent variables which are taken into account for most of the variances in the data). In practice, this technique can have profound effects on the speed of refinement. Actually, if a random initial configuration is used, a significant portion of the training time is consumed in the establishment of the general structure and the topology of the display map, which is typically characterized by large rearrangements. If, on the other hand, the input configuration is partially ordered, the error criterion can be • reduced relatively quickly to an acceptable level. If the data is highly grouped, by virtue of the sampling process, the low density areas can be refined less effectively than the high density areas. In an exemplary embodiment, this trend can be partially compensated for by a modification to the original algorithm which increases the probability of sampling in areas of low density. In one embodiment, the center of the mass of the display map is identified, and concentric envelopes centered at this point are constructed. A series of iterations of regular refinement are then carried out, each time selecting the points from within or between these envelopes. This process is repeated during a prescribed number of cycles. This phase is then followed by a regular refining phase using global sampling, and the process is repeated. In general, the basic algorithm does not distinguish short interval distances from long interval distances. Equations 10 and 11 describe a method to ensure that short interval distances are preserved more accurately than long interval distances by the use of weighting or compensation.
An alternative (and complementary) approach is to ensure that points in a narrow separation are sampled more extensively than points in a long separation. For example, an alternative sequence of global and local refinement cycles, similar to one described above, may be employed. In this embodiment, a global refinement phase is carried out initially, after which, the resulting display map is distributed in a regular grid. The points (objects) in each cell of the grid are then subjected to a local refinement phase (ie only points within the same cell are compared and refined). Preferably, the number of sampling steps in each cell should be proportional to the number of points contained in this cell. This process is highly parallelizable. This phase of local refinement is then followed by another phase of global refinement, and the process is repeated for a prescribed number of cycles, or until the intercalation error is minimized within a prescribed tolerance. Alternatively, the grid method can be replaced by another suitable method identifying nearby points, such as the k-d tree, for example. The methods described herein can be used for increasing refinement. That is, starting from an organized exhibition map of a set of points, a new set of points can be added without modifying the original map. Strictly speaking, this is statistically acceptable if the new set of points is significantly smaller than the original set. In an exemplary embodiment, the new set of points can be "broadcast" on the existing map, using a modification of the basic algorithm described above. In particular, Equations 8 and 9 can be used to update only the entry points. In addition, the sampling procedure ensures that the selected pairs contain at least one point of the input set. That is, two points are selected randomly so that at least one of these points belongs to the input set. Alternatively, each new point can be broadcast independently using the approach described above.
VIII. Evaluation Properties (Characteristics), Correlations and Measurements of Distance In an exemplary embodiment, the correlations between the objects can be represented as similarities / dissimilarities between the objects on an exhibition map and can be derived from the properties or characteristics associated with the objects. Any similarity measurement can be used to construct the display map. The properties or characteristics that will be used to evaluate similarity or dissimilarity are collectively called-sometimes "evaluation properties." For example, if the objects are chemical compounds, the similarity between the objects may be based on structural similarity, chemical similarity, physical similarity, biological similarity, and / or some other type of similarity measurement which may be derived from the structure or identity of the compounds.
A. Evaluation Properties that Have Continuous or Discrete Real Values Similarity measurements can be derived from a list of evaluation properties associated with a set of objects. For example, if the objects are chemical compounds, the evaluation properties can be physical, chemical and / or biological properties associated with a set of chemical compounds. Under this formalism, objects can be represented as vectors in a multivariable property space, and their similarity can be calculated by some measurement of geometric distance. In an exemplary mode, the property space defined using one or more features or descriptors. For the example of the chemical compound, the space of the property can be defined using one or more molecular characteristics or descriptors. Such molecular characteristics may include topological indices, physicochemical properties, electrostatic field parameters, volume and surface parameters, etc. These characteristics may include, but are not limited to, the molecular volume and surface areas, the dipole moments, the octanol-water partition coefficients, the molar reactivities, the heats of formation, the total energies, the ionization potentials, the molecular connectivity indexes, the 2D and 3D autocorrelation vectors, the pharmacophoric and / or 3D structural parameters, the electronic fields, etc. It should be understood, however, that the present invention is not limited to this embodiment. For example, molecular features may include the observed biological activities of a set of compounds against a network or array of biological targets such as enzymes or receptors (also known as affinity fingerprints). In effect, any vector representation of the chemical data can be used in the present invention. It should also be understood that the present invention is not limited to the application with the objects of the chemical compound. Instead, the present invention can be implemented with any data set or objects, including objects that are associated with evaluation properties that have continuous or discrete real values. 1. Correlations or Distance Measurements Where the Values of the Evaluation Properties are Continuous or Discrete Real Numbers A "distance measurement" is an algorithm or technique used to determine a relationship between objects, based on the selected evaluation properties. The measurement of the particular distance that is used in any given situation depends, at least in part, on the set of values that evaluation properties can take. For example, where the evaluation properties can take real numbers as the values, then a measurement of the appropriate distance is the Minkowski metric, shown in Equation 14: where k is used to index the elements of the property vector, and re [l, a > ). For r = 1.0, Equation 14 is the city block or Manhattan metric. For r = 2.0, Equation 14 is the ordinary Euclidean metric. For r = oo, Equation 14 is the maximum of the distances of the absolute coordinates, also referred to as the "dominant" metric, the "sup" metric, or the "ultra-electric" distance. For any value of re [l, 8), it can be shown that the Minkowski metric is a true metric, that is, it obeys the postulates of the distance and, in particular, the inequality of the triangles.
B. Evaluation Properties that have Binary Values Alternatively, the evaluation properties of the objects can be represented in a binary form, where the bits are used to indicate the presence or absence, or the potential presence or absence, of the features or characteristics. For example, if the objects are chemical compounds, the objects can be coded using keys of the substructure where each bit denotes the presence or absence of a specific structural feature or configuration in the target or target molecule. Such features may include, but are not limited to, the presence, absence or minimum number of presentations of a particular element (e.g. the presence of at least 1, 2 or 3 nitrogen atoms), unusual or important electronic configurations and types of atoms (for example double-linked nitrogen or aromatic carbon), common functional groups such as alcohols, amines, etc., certain primitive rings or compounds, a couple or triplet of pharmacoforic groups at a particular spacing in the 3-dimensional space, and the "disjunctions" of unusual features that are not rare enough to value an individual bit, still extremely important when they occur. Typically, these unusual characteristics are assigned to a common bit that is set if any of the configurations is present in the target or target molecule.
Alternatively, the evaluation properties of the compounds can be encoded in the form of binary fingerprints, which do not depend on a predefined fragment or feature dictionary to effect bit allocation. Instead, each configuration in the molecule up to a predefined limit is systematically enumerated, and serves as an input to a copied or shredding algorithm that "activates" a small number of bits at the pseudo random positions in the bitmap. Although it is conceivable that two different molecules can have exactly the same trace, the probability of this happening is extremely small for almost all the simplest cases. Experience suggests that these fingerprints contain sufficient information about molecular structures to allow meaningful similarity comparisons. 1. Measurements of the Distance where the Values of the Evaluation Properties are binary A number of measurements of the relationship can be used with the binary descriptors (ie, where the evaluation properties are binary or binary traces). Some of the most frequently used are the standardized Hamming distance: H > VíORiXmVÍ Ec. fifteen N which measures the number of bits that are different between x-e and, the coefficient of Tanimoto or Jaccard: which is a measure of the number of substructures shared by two molecules in relation to some that they might have in common, and the coefficient of Dice: In the equations listed above, AND (x, y) is the intersection of the binary sets x and y (bits that are "activated" in both sets), I0R (x, y) is the u or "inclusive" union of x and y (the bits are "activated" in either x or y), XOR is the "or exclusive" of x and y (the bits that are "activated" either in x or y, but not in both), | x | is the number of bits that are "activated" in x, and N is the length of the binary sets measured in bits (a constant). Another popular metric is the Euclidean distance which, in the case of binary sets, can be retaken in the form: e = N-rXOR (x.N0T (% Ec. 18 where NOT (y) denotes the binary complement of y. The expression | XOR (x, NOT (y) | represents the number of bits that are identical in x and y (either ones or zeros.) Euclidean distance is a good measure of similarity when binary sets are relatively rich, and is used mostly in situations in which similarity is measured in a relative sense In the example of the compound, the distance between the objects can be determined using a binary or multivariable representation, however, the present invention is not limited To this modality, for example, the similarity between two compounds can be determined by comparing the shapes of the molecules using a suitable 3-dimensional alignment method, or can be inferred by a model of similarity defined according to a prescribed procedure. , one of such similarity models can be a neural network trained to predict a coefficient of similarity given by a suitable coded pair of the compounds. Such a neural network can be trained using a training set of the pairs of structures and a known similarity coefficient for each such pair, as determined by the user input, for example.
C. Scaling of Evaluation Properties Referring again to Equation 14, the characteristics (ie, the evaluation properties) can be scaled differently to reflect their relative importance in the evaluation of the relationship between the compounds. For example, property A can be assigned a weight of 2, and property B can be assigned a weight of 10. Property B will therefore have five times more impact on the calculation of the ratio than Property A. Consequently, Equation 14 can be replaced by Equation 19: Ec. 19 where w_ is the weight of the k-th property. An example of such a weighting or compensation factor is a normalization coefficient. However, other weighting or compensation schemes can also be used. The scaling (s) (weights) do not need to be uniform from beginning to end of the complete map, ie the resulting map does not need to be isomorphic. After this, the maps derived from the uniform weights will be referred to as weighted or globally compensated (isomorphic), while maps derived from non-uniform weights or weights will be referred to as weighted or locally compensated (non-isomorphic). On locally weighted maps, the correlations (or distances) on the display map may reflect a local measurement of similarity. That is, what determines similarity in a domain of the display map is not necessarily the same as determining the similarity in another domain of the display map. For example, compensated or locally weighted maps can be used to reflect the similarities derived from a weighted or locally compensated case-based learning algorithm. Locally weighted learning uses locally weighted training to average, interpolate, extrapolate from, or otherwise combine, the training data. Most learning methods (also referred to as modeling or prediction methods) build a unique model for the fit of all training data. Local models, on the other hand, try to adjust the training data in a local region around the location of the question. Examples of local models include the closest environments, the weighted average, and the locally weighted regression. Locally weighted learning is reviewed in Vapnik, in Advances in Neural Information Processing Systems, 4: 831, Morgan-Kaufman, San Mateo, CA (1982); Bottou and Vapnik, Neural Computation, 4 (6): 888 (1992), and Vapnik and Bottou, Neural Computation, 5 (6): 893 (1993), all of which are incorporated herein for reference in their entirety. Display maps can also be constructed from a matrix of correlations that is not strictly symmetric, that is, a matrix of correlations where r? J? r-j_ A potential use of this approach is in situations where a relationship (that is, the relationship function) is locally defined, for example, in a locally weighted or compensated model using a local distance function based on a point . In this modality, each training case is associated with a function of the distance and the values of the corresponding parameters. Preferably, to construct an exhibition map which reflects these correlations of the local distance, the distance between two points is evaluated twice, using the functions of the local distance of the respective points. The resulting distances are averaged, and are used as input into the mapping algorithm of the display described above. If functions of local distance based on a point vary in some continuous or semicontinuous way throughout the feature's space, this approach could potentially lead to a significant projection.
IX. Implementation of the Invention ? Generalities The invention can be implemented in a variety of ways, using a variety of algorithms and can be implemented using programming elements, programs, microprograms or any combination thereof. Referring to Figure 6, an exemplary block diagram illustrates the modules and data flow that can be included in a system 610 implementing the present invention. The block diagram of Figure 6 is proposed to assist in the understanding of the present invention. The present invention is not limited to the exemplary embodiment illustrated in the block diagram of Figure 6. The system 610 includes a database 612 of the correlations, which stores the data of the correlations 630 associated with the objects. The types of the data and the associated correlations that can be accommodated by the database 612 of the correlations have no limits, because the present invention can be implemented with any type of data for which the correlations can be defined. The data of the correlations 630 can be provided from one or more of a variety of sources. For example, the correlation 630a can be provided by an external source 632, the ratio 630b can be provided from other sources 640, and the data of the 630n relation can be generated by an optional correlation generator module 634, based on the evaluation properties 636. The generator module of an optional 634 relation may include the physical elements, the programming elements, the microprogram elements or any combination thereof to execute one or more algorithms such as, for example, one or more of the equations 14-19. The data of the relation 630 is provided to a coordinated module 616. In an exemplary embodiment, the relation 630 is provided to the coordinated module 616 as a correlation matrix 614, which is preferably a matrix that stores any amount of the data of the correlations 630 from the database of the correlations 612. The coordinated module 616 assigns initial coordinates to the points or data objects that are related by the data of the correlations 630. The initial coordinates can be assigned randomly or by means of any another technique For example, the data can be preordered or partially ordered. The coordinates include an exhibition map. The display map can be a linear display or screen map. The display map is an n-dimensional display map. The coordinate / correlation subsets 618 and associated relationships 620 are provided to a coordinate revision module 622. In an exemplary embodiment, a coordinate / correlation subset 618 is provided to the coordinate revision module 622 at a time.
A selector module 636 of the subassembly can be provided to select the correlation / coordinate subsets 618 to be provided to the revision module of the coordinates 622. The selector module 636 of the subsets can select the correlation / coordinate subsets 618 randomly or by any other suitable method, including one or more of the methods described above. The 622 coordinate review module reviews the positions of the objects on the display map (ie, checks coordinates 618) based on the precise or imprecise measurements of similarity / dissimilarity (correlations 620). More specifically, the coordinate revision module 622 measures the distances between the objects on the display map and compares them with the associated relations 620. The coordinate revision module 622 then checks the 618 coordinates based on the comparisons. Such distances can be used directly, or to modify other attributes of the exhibition. The revision module of the coordinates 622 can include the physical elements, the programming elements, or the microprogramming elements or any combination thereof to execute one or more conventional non-linear mapping or multidimensional scaling algorithms, as described above. Additionally, or alternatively, the revision module of the coordinates 622 may include the physical elements, the programming elements, the microprogram elements or any combination thereof to execute one or more algorithms for the analysis in pairs such as, for example, one or more of equations 8 to 13, or the variations thereof. When the 622 coordinate review module performs paired analyzes as described above, can a learning speed be applied? to ensure the convergence of the distance between the coordinates in the subsets of the coordinates / correlations 618 and the correlation (s) associated (s) 620. The module of revision of the coordinates 622 can be designed to represent precise or imprecise measurements of similarity / dissimilarity (correlations 620). For example, the revision module of the coordinates 622 can be programmed to handle the matrices of complete pairs that do not have uncertainties, the matrices of dispersed pairs that do not have uncertainties, the matrices by pairs that include intercalated or joined uncertainties, and the matrices in pairs that include unbound or interleaved uncertainties (that is, altered data), or any combination thereof. The revision module of the coordinates 622 can also be programmed to broadcast the data points or additional objects in a set of objects, as described above. The revision module of the coordinates 622 generates the revised coordinates 624, which are returned to the module of the coordinates 616. This process is repeated for the additional subsets of the coordinates 618 and the associated correlations 620, and preferably they are repeated on them subsets of correlations / coordinates 618 and associated correlations 620, until a prescribed tolerance or some other criterion is satisfied. In an exemplary embodiment, wherein the display of the correlations between the objects is desired, the coordinates 626 may be provided to an optional display module 628 for display. When the iterative process of the invention continues, the revised coordinates 626 are provided to the optional display module 628.
B. Implementation of the Invention in a Computer Program Product The present invention can be implemented using one or more computers. Referring to Figure 2, an exemplary computer 202 includes one or more processors, such as the processor 204. The processor 204 is connected to a communications bus 206. Several modalities of the programming elements are described in terms of this system. exemplary computer. After reading this description, it will become apparent to a person skilled in the relevant technique (s) how to implement the invention using other computer systems and / or computer architectures. The computer 202 also includes a main memory 208, preferably a random access memory (RAM) and may also include one or more secondary storage devices 210. The secondary storage devices 210 also include, for example, a hard disk device 212 and / or a removable storage device 214, which represents a flexible magnetic disk device, a magnetic tape device, or an optical disk device, etc. The removable storage device 214 reads and / or writes to a removable storage unit 216 in a well-known manner. The removable storage unit 216 represents a flexible magnetic disk, a magnetic tape, optical disk, etc., which is read by and written by the removable storage device 214. The removable storage unit 216 includes a storage medium that is can use by the computer, what is stored in the same programs and / or computer data. In alternative embodiments, the computer 202 may include other similar means to allow computer programs or other instructions to be loaded into the computer 202. Such means may include, for example, a removable storage unit 220 and an interconnection 218. The examples of such means may include a cartridge of the program and the interconnection of the cartridge (such as those found in video game devices), a removable memory microcircuit (such as an EPROM, or PROM) and the associated receptacle, and other removable storage units 220 and interconnects 218 which allow the program and data to be transferred from the storage unit 220 to the computer 202. The computer 202 may also include a communications interface 222. The interface of the communications 222 allows that programs and data are transferred between the computer 202 and the devices external ivos. Examples of communication interface 222 include, but are not limited to a modulator-demodulator, a network interface (such as an Ethernet card), a communications gate, a PCMCIA slot and card, etc. The program and the data transferred by means of the communication interface 222 are in the form of signals (typically the data on a carrier) which may be electronic, electromagnetic, optical or other signals capable of being received by the communication interface. 222. In this document, the term "computer program product" is used to refer generally to a medium such as removable storage units 216, 220, a hard disk device 212 that can be removed from computer 202, and the signals carrying the programs received by the communication interface 222. These computer program products are means for providing the programming means to the computer 202. The computer programs (also called computer control logic) are stored in the computer. the main memory and / or the secondary storage devices 210. The computer programs The computer can also be received by the communications interface 222. Such computer programs, when executed, make it possible for the computer 202 to operate the features of the present invention as described herein. In particular, computer programs, when executed, make it possible for processor 204 to execute the features of the present invention. Accordingly, such computer programs represent the controllers of the computer 202. In a mode wherein the invention is implemented using programming elements, the programming elements can be stored in a computer program product and loaded into the computer 202 using the removable storage device 214, the hard disk device 212, and / or the communication interface 222. The logical control elements (programming elements), when executed by the processor 204, perform the functions of the invention as It was described here. In another embodiment, the automated portion of the invention is implemented primarily or completely in the physical elements using, for example, physical components such as application-specific integrated circuits (ASICs). The implementation of the machine of the state of the physical elements to perform the functions described here will be evident to the experts in the relevant technique (s).
In still another embodiment, the invention is implemented using a combination of both physical elements and programming elements. The computer 202 may also be any suitable computer, such as a computer system that runs or operates an operating system that supports an interconnection with a graphic user and a window opening environment. A suitable computer system is a workstation / server of Silicon Graphics, Inc. (SGI), a workstation / server Sun, a workstation / DEC server, an IBM workstation / server, a PC compatible with IBM, an Apple Macintosh, or any other suitable computer system, such as one that uses one or more processors of the Intel Pentium family, such as Pentium Pro or Pentium II. Suitable operating systems include, but are not limited to, IRIX, OS / Solaris, Digital Unix, Microsoft Windows 95 / NT, Apple Mac OX, or any other operating system. For example, in an exemplary mode the program can be implemented and run on a Silicon Graphics Octane workstation that runs or operates the IRIX 6.4 operating system, and using the interconnection with the graphic user Motif based on the X Window System.
C. Operation of the Present Invention Referring to Figure 7, the operation of the present invention is illustrated in a process flow diagram 700. The operation of the present invention is illustrated for a general case where a correlation matrix 614 is a matrix of complete pairwise correlations. , without uncertainties. Based on the above descriptions and the process flow diagram 700, an expert in the relevant technique (s) will be able to modify the process flow diagram 700 to accommodate other situations such as, for example, example: where a correlation matrix 614 is a matrix of correlations in pairs or n ways, diffused, without uncertainties; wherein a correlation matrix 614 is a matrix of correlations in pairs or in n ways with intercalated or joined uncertainties; wherein a correlation matrix 614 is a matrix of correlations in pairs with unbound or interleaved uncertainties (ie, altered data); etc. The process for a general case where a correlation matrix 614 is a complete pairwise correlation matrix without uncertainties, begins at step 702, where the modulus of the coordinates 616 receives the correlation matrix 614 from the data base. correlations 612. In step 704, the coordinate module 616 assigns initial coordinates to the objects associated with the relationships in the correlation matrix 614. The initial coordinate assignment can be done at random. Alternatively, the initial coordinates may be preordered or partially pre-coordinated. In step 706, a subset of correlations / coordinates 618 is selected from a correlation matrix 614 for review. The subset 618 can be selected randomly, semi-randomly, systematically, partially systematically, etc., by the selector of subsets 638. In step 708, the selected subset 618 and an associated relation 620 are provided to the revision module coordinate 622. The coordinate revision module 622 checks the coordinates in the coordinate / correlation subset 618, based on the associated correlations 620. In step 710, a determination is made that if another revision subset is selected of coordinates. If another subset of coordinates / correlations 618 is to be reviewed, the processing returns to step 706 for the selection of another coordinate / mapping subset 618. Otherwise, processing is stopped in step 712. In an optional exemplary embodiment, the coordinates 626 are provided in step 714 to the optional display module 628 for display. Step 714 may be performed at any time during one or more steps 706-1712. In another optional exemplary modality, the data of the correlation 630 is generated prior to step 702. In this optional exemplary mode, the evaluation properties 636 are received in step 716. In step 718, the generator of the correlations 634 generates the data of the correlations 630. from the evaluation properties. In step 720, the data of the correlations 630 are provided to the database of the correlations 612. The processing proceeds to step 702, wherein the data of the correlations 630 are provided to the module of the coordinates in the form of the correlation matrix 614.
X. Examples of the Invention The present invention can be implemented in a variety of applications and with a variety of data types. In an exemplary embodiment, the present invention can be implemented as a system, method, and / or computer program product to interactively visualize and analyze data that refers to chemical compounds, wherein the distances between objects in a space multidimensional represent the similarities and / or dissimilarities of the corresponding compounds (in relation to the properties or selected characteristics of the compounds) calculated by some prescribed method. The resulting maps can be displayed on a suitable graphics device (such as a graphics terminal, for example), and analyzed interactively to reveal the relationships between the data, and to initiate a network or series of tasks related to these compounds. A user may select a plurality of compounds for a map, and a method for evaluating the similarity / dissimilarity among the selected compounds. An exhibition map can be generated according to the selected compounds and the selected method. The display map has a point for each of the selected compounds, wherein a distance between any two points is representative of the similarity / dissimilarity between the corresponding compounds. A portion of the display map is then displayed. It is possible for users to interactively analyze the compounds represented on the display map. Alternatively, all points may each correspond to multiple objects or compounds. Figure 1 is a block diagram of a computing environment 102 according to an exemplary embodiment of the present invention. An interactive analysis and visualization module for the chemical data 104 includes a module for generating a map 106 and one or more components 108 for interconnection with the user, auxiliaries. The map generation module 106 determines the similarities between the chemical compounds related to one or more selected properties or characteristics (here sometimes called properties or evaluation characteristics) of the compounds. The map generation module 106 performs this function by retrieving and analyzing the data on the chemical compounds and reagents from one or more databases 120. The visualization of the chemical data and the interactive analysis module 104 communicate with one or more bases data 120 through the communication means 118. The communication means 118 is preferably any type of data communication means, such as a data bus, a network of computers, etc. The. Interconnection modules 108 with the user exhibit a preferably 2D or 3D display map on a suitable graphics device. The interconnection modules 108 with the user make it possible for human operators to interactively analyze and process the information on the display map to reveal the relationships between the data, and to initiate a network or series of tasks related to the corresponding compounds. The interconnection modules 108 with the user make it possible for the users to organize the compounds as collections (which represent, for example, a combination library). The information pertaining to the collections of the compounds is preferably stored in one or more databases 120. The input device (s) 114 receives the input (such as data, commands, questions, etc.) from the human operators and forward such input, for example, to the interactive analysis and visualization module of the chemical data 104 through the communication means 118. Any suitable, well-known input device can be used in the present invention, such like a board, punctuation device (mouse, rotating ball, tracer ball, photosensitive pencil, etc.), touch screen, vocal signal recognition, etc. The user's input can also be stored and then retrieved, when appropriate, from the command / data files. The output device (s) 116 outputs the information to the human operators. Any suitable, well-known output device can be used in the present invention, such as a monitor, a printer, a flexible magnetic disk device, a speech-to-text synthesizer, etc. The chemical analysis interactive and display module 104 may interact with one or more computing modules 122 through the communication means 118. The components shown in the computing environment 102 of Figure 1 (such as the module of interactive analysis and visualization of chemical data 104) can be implemented using one or more computers, such as an exemplary computer 202 shown in Figure 2.
A. Operation of the Exemplary Modality The operation of the present invention as implemented to interactively visualize and process the chemical compounds in an exhibit map, will now be described with reference to a flow chart 302 shown in Figure 3. Unless otherwise specified, the The interaction with the users described below is achieved by the operation of the interconnection modules 108 with the user (Figure 1). In step 304, the user selects one or more compounds for mapping in a new display map. The user can select the compounds for the mapping by retrieving a list of the compounds from a file, manually typing or writing a list of the compounds, and / or using a graphical user interface (GUI). The invention contemplates other means for having the user specify the compounds for display on an exhibit map. In step 306, the user selects a method that will be used to evaluate the molecular similarity or dissimilarity between the compounds selected in step 304. In one embodiment, the similarity / dissimilarity between the compounds selected in step 304 is determined ( in step 308) based on a prescribed set of evaluation properties. As described above, the evaluation properties can be any properties related to the structure, function, or identity of the compounds selected in step 304. The evaluation properties include, but are not limited to, the structural properties, the functional properties , the chemical properties, the physical properties, the biological properties, etc., of the compounds selected in step 304. In one embodiment of the present invention, the selected evaluation properties can be scaled differently to reflect their relative importance in the evaluation of proximity (ie, similarity or dissimilarity) between two compounds. Accordingly, also in step 306, the user selects a scaling factor for each of the selected evaluation properties. Note that such selection of scale factors is optional. The user does not need to select a scaling factor for each selected evaluation property. If the user does not select a scaling factor for a given evaluation property, then this evaluation property is given a default scale factor or failure, such as the unit. Alternatively in step 306, the user may choose to retrieve similarity / dissimilarity values belonging to the compounds selected in step 304 from a source, such as a database. These values of similarity / dissimilarity in the database were previously generated. In another embodiment, the user in step 306 may choose to determine the similarity / dissimilarity values using any well-known technique or procedure. In step 308, the map generation module 106 generates a new display map. This new display map includes a point for each of the compounds selected in step 304. Also, in this new display map, the distance between any two points is representative of the similarity / dissimilarity of the corresponding compounds. The manner in which the module 106 for generating a map generates the new display map should now be further described with reference to the flow diagram 402 in Figure 4. In step 404, the coordinates on the new display map of the points corresponding to the compounds selected in step 304 are initialized. In step 406, two of the compounds i, j, selected in step 304 are selected for processing. In step 408, the similarity / dissimilarity r_j between the compounds i, j is determined based on the method selected by the user in step 306. In step 410, based on the similarity / dissimilarity rij determined in step 408, the coordinates of the points corresponding to the compounds i, j on the display map are obtained. In step 412, the training / learning parameters are updated. In step 414, a decision is made to finish or not to finish. If a decision is made not to finish at this point, then the control returns to step 406. Otherwise, step 416 is performed. In step 416, the display map is output (ie, map generation) of exhibition is complemented). The details with respect to the steps of the flow diagram 402 are described above. Referring again to Figure 3, in step 312 the map viewer 112 exhibits the new display map on an output device 116 (such as a computer graphics monitor). In step 314, the interconnection modules 108 with the user make it possible for the operators to interactively analyze and process the compounds represented in the display map shown. The present invention makes it possible for users to modify the existing composite display displays maps (when used herein, the term "composite display display map" refers to a produced display map). For example, users can add additional compounds to the map, remove compounds from the map, highlight the compounds on the map, etc. In such cases, the relevant functional steps of the flow diagram 302 are repeated. For example, steps 304 (selection of compounds to make the map), 310 (generation of the exhibition map), and 312 (map display) are repeated when the user chooses to add new compounds to an existing map. However, according to one embodiment of the invention, the map is increasingly refined and displayed in steps 310 and 312 when compounds are added to a display display map of the existing compound (this increasing refinement is described above). The example of the chemical compound provided above is useful for interactively visualizing and processing any chemical entities including but not limited to (but can be used for) small molecules, polymers, peptides, proteins, etc. It can also be used to exhibit different relationships of similarity between these compounds.
XI. Conclusions The present invention has been described above with the aid of functional building blocks that illustrate the operation of the functions and relationships specified therein. The boundaries of these functional building blocks have been arbitrarily defined here for the convenience of description. The alternative limits can be defined provided that the specified functions and relationships of the same are carried out properly. Any such alternative limits are thus within the scope and spirit of the claimed invention and could be apparent to those skilled in the relevant arts. These functional building blocks can be implemented by discrete components, application-specific integrated circuits, processors that execute the appropriate and similar programs or any combination thereof. It is considered within the scope of a person skilled in the relevant techniques to develop the appropriate circuits and / or programs to implement these functional building blocks. Based on the foregoing descriptions and examples, a person skilled in the relevant arts will be able to implement the present invention in a wide variety of applications, all of which are considered within the scope of the invention. Although various embodiments of the present invention have been described above, it should be understood that they have been presented only by way of example, and not as a limitation. Accordingly, the width and scope of the present invention should not be limited by any of the exemplary embodiments described above, but should be defined only in accordance with the following claims and their equivalents.
It is noted that in relation to this date the best method known by the applicant to carry out the aforementioned invention, is that which is clear from the present description of the invention.
Having described the invention as above, property is claimed as contained in the following

Claims (26)

REGVIDICATIONS
1. A method to represent the correlations between the objects as the distances in a related way, on an exhibition map, the method is characterized because it comprises the steps of: (1) placing the objects on the exhibition map; (2) selecting a subset of the objects, wherein the selected subset of the objects includes the associated correlations between the objects in the selected subset; (3) review the distance (s) between the objects on the display map, based on the correlations between the objects and the distance (s); (4) repeat steps (2) and (3) for the additional subsets of the objects in the set of objects.
2. The method according to claim 1, characterized in that step (2) comprises the step of: (a) selecting a pair of objects having an associated pairwise relationship.
3. The method according to claim 2, wherein the correlations between one or more pairs of objects are unknown, the method is characterized in that it additionally comprises the steps of: (4) performing steps (2) to (4) only by pairs of objects for which an associated correlation is known; and (5) allow the distances between the objects on the exhibition map for which the correlations are not known, adapt during the operation of the . steps (2) to (4).
4. The method according to claim 2, wherein one or more pairs of the objects are related by interleaved or joined uncertainties, the method further comprising the step of: (5) reviewing the distance on the display map between a pair of objects which are related by a correlation with an interleaved or joined uncertainty specified as a set of allowable ranges of correlation values, only when the distance falls outside the specified ranges.
5. The method according to claim 2, wherein one or more pairs of objects are related by the interleaved or joined uncertainties, the method is characterized in that it further comprises the step of: (5) reviewing the distance on the display map between a pair of objects that are related by a correlation with an interleaved or joined uncertainty specified as an upper limit of the allowable correlation values, only when the distance falls above the specified upper limit.
6. The method according to claim 2, wherein one or more pairs of objects are related by interleaved or joined uncertainties, the method is characterized in that it further comprises the step of: (5) reviewing the distance on the display map between a pair of objects that are related by a correlation with an interleaved or joined uncertainty specified as a lower limit set of the permissible correlation values, only when the distance falls outside the specified upper limit.
7. The method according to claim 2, wherein one or more pairs of objects are related by the non-interleaved or joined uncertainties, the method is characterized in that it further comprises the steps of: (5) identifying a pair of objects for which the Corresponding correlation contains an uncertainty not interleaved or joined; (6) remove the correlation that contains the non-interleaved or joined uncertainty; (7) allow the distance between the objects for which the corresponding relation has been removed so that they adapt during the operation of steps (2) to (4).
8. The method according to claim 2, characterized in that step (3) comprises the step of: (a) reviewing the distance (s) based on a learning speed.
9. The method according to claim 2, characterized in that step (3) comprises the step of: (a) reviewing the distance (s) based on a fixed learning speed.
10. The method according to claim 2, characterized in that step (3) comprises the step of: (a) reviewing the distance (s) based on an adaptive learning speed.
11. The method according to claim 2, characterized in that step (3) comprises the step of: (a) reviewing the distance (s) based on a dynamic learning speed.
12. The method according to claim 2, characterized in that step (3) comprises the step of: (a) reviewing the distance (s) based on a learning speed that is a function of the correlation between the pair selected objects.
13. The method according to claim 2, characterized in that step (3) comprises the step of: (a) reviewing the distance (s) based on a learning speed that is a function of one or more of the Selected objects.
14. The method according to claim 2, characterized in that step (3) comprises the step of: (a) reviewing the distance (s) based on a learning speed that is a function of the selected pair.
15. The method according to claim 1, characterized in that step (3) comprises the steps of: (a) reviewing the distance (s) using a conventional multidimensional scaling technique.
16. The method according to claim 1, characterized in that step (3) comprises the steps of: (a) reviewing the distance (s) using a conventional non-linear scaling technique.
17. The method according to claim 1, characterized in that step (3) comprises the steps of: (a) calculating a value of the error function using a conventional technique; and (b) review the distance (s) using a gradient descending procedure.
18. The method according to claim 1, characterized in that the objects are not chemical objects.
19. A method for representing the correlations-between the objects as the interrelated distances on an exhibition map, the method is characterized in that it comprises the steps of: (1) placing the objects on the display map; (2) selecting a subset of the objects, wherein the selected subset of the objects includes the associated correlations between the objects in the selected subset; (3) select a pair of objects from the selected subset, the pair of objects has an associated pairwise correlation; (4) review the distance (s) between the objects on the display map based on the correlations between the pair of objects and the distance (s); (5) repeat steps (3) and (4) for additional pairs of objects of the selected subset of the objects.
20. The method according to claim 19, characterized in that it further comprises the step of: (5) selecting a second subset of the objects; and (6) iteratively repeating steps (3) and (4) for the pairs of the objects in the second selected subset of the objects.
21. A system for representing the relations between the objects in a set of the objects as the distances related to one another on an exhibition map, characterized in that it comprises: a coordinate module that places the objects on an exhibition map; a selector of the subset that selects the subsets of the objects for the review of the distance (s) between them; and a revision module of the coordinates that reviews the distance (s) between the objects in the selected subset, based on a difference between the distance (s) and the corresponding relation.
22. The system according to claim 21, characterized in that it further comprises: a subset selector that selects pairs of objects for reviewing the distance between them.
23. The system according to claim 21, characterized in that it also comprises: a selector of subsets that selects more than two objects for the revision of the distance between them, and a module of revision of the coordinates that reviews the distances between the objects in the subsets selected using conventional techniques.
24. The system according to claim 23, characterized in that it further comprises: a module for revision of the coordinates that calculates a value of the error function using a conventional technique and that reviews the distances using a descending gradient procedure.
25. The system according to claim 23, characterized in that it further comprises: a module for reviewing the coordinates that calculates a value of the error function using a conventional multidimensional scaling technique.
26. The system according to claim 23, characterized in that it further comprises: a module for revision of the coordinates that calculates a value of the error function using a conventional non-linear scaling technique. SUMMARY OF THE INVENTION The present invention relates to a computer program, method and product system for representing precise or imprecise measurements of similarity / dissimilarity (correlations) between objects as the distances between points in a multidimensional space representing the objects. The principles of self-organization are used to iteratively refine an initial configuration (partially or randomly) of points, using probabilistic distance / correlation errors. The data may be complete or incomplete (ie some relationships between the objects may be unknown), exact or inaccurate (ie some or all of the correlations may be given in terms of the allowed ranges or limits), symmetric or asymmetric ( that is, the correlation of object A with respect to object B may not be the same as the correlation of B with respect to A) and may contain systematic or probabilistic errors. The correlations between the objects can be derived directly from observation, measurement, a priori knowledge, or can be determined indirectly using any suitable technique to derive the proximity data (correlation). The present invention iteratively analyzes the subsets of the objects to represent them in a multidimensional space representing the objects. In an exemplary embodiment, the present invention iteratively analyzes the subsets of the objects using conventional multidimensional scaling or non-linear mapping algorithms. In another exemplary embodiment, the correlations are defined as pairs or similarities / dissimilarities correlations by pairs between the pairs of objects and the present invention iteratively analyzes a pair of objects at a time. Preferably, the subsets are evaluated in pairs, as a double inclusion closed circuit.
MXPA/A/2000/010727A 1998-05-07 2000-10-31 System, method,and computer program product for representing proximity data in a multi-dimensional space MXPA00010727A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09073845 1998-05-07

Publications (1)

Publication Number Publication Date
MXPA00010727A true MXPA00010727A (en) 2001-09-07

Family

ID=

Similar Documents

Publication Publication Date Title
US6453246B1 (en) System, method, and computer program product for representing proximity data in a multi-dimensional space
Ivezić et al. Statistics, data mining, and machine learning in astronomy: a practical Python guide for the analysis of survey data
US6295514B1 (en) Method, system, and computer program product for representing similarity/dissimilarity between chemical compounds
Bar-Joseph et al. A new approach to analyzing gene expression time series data
He et al. Kernel K-means sampling for Nyström approximation
Wu et al. On quantitative evaluation of clustering systems
Srebro et al. Weighted low-rank approximations
Morrison et al. Fast multidimensional scaling through sampling, springs and interpolation
Thompson et al. Predicting solvent accessibility: higher accuracy using Bayesian statistics and optimized residue substitution classes
Karlov et al. Chemical space exploration guided by deep neural networks
Wenzel et al. Data-driven kernel designs for optimized greedy schemes: A machine learning perspective
US7054757B2 (en) Method, system, and computer program product for analyzing combinatorial libraries
Downs 3.2 Clustering of Chemical Structure Databases for Compound Selection
MXPA00010727A (en) System, method,and computer program product for representing proximity data in a multi-dimensional space
Takigawa et al. Generalized sparse learning of linear models over the complete subgraph feature set
US20050192758A1 (en) Methods for comparing functional sites in proteins
Ishizone et al. Representation of Protein Dynamics Disentangled by Time-Structure-Based Prior
Schreurs et al. Towards deterministic diverse subset sampling
Zhang et al. Fishing expedition-a supervised approach to extract patterns from a compendium of expression profiles
Ghouchani Applications of Deep Neural Networks in Computer-Aided Drug Design
Gobbi et al. Developing an in-house system to support combinatorial chemistry
Aouadi et al. selectBoost: a general algorithm to enhance the performance of variable selection methods in correlated datasets
Husic Modeling and Interpreting Molecular Kinetics from Simulation Data
Arguelles Clustering of Protein Structures
WO2022146632A1 (en) Protein structure prediction