WO2001016880A2

WO2001016880A2 - Topographic map and methods and systems for data processing therewith

Info

Publication number: WO2001016880A2
Application number: PCT/BE2000/000099
Authority: WO
Inventors: Marc Van Hulle
Original assignee: Synes Nv
Priority date: 1999-08-30
Filing date: 2000-08-30
Publication date: 2001-03-08
Also published as: EP1222626A2; WO2001016880A3; AU6812200A

Abstract

An unsupervised competitive learning rule for equiprobabilistic topographic map formation, called the kernel-based Maximum Entropy learning Rule (kMER) is described for execution by a neural network as well as systems, especially distributed processing systems for carrying out the rule. Since kMER adapts not only the neuron weights but also the radii of the kernels centered at these weights, and since these radii are updated so that they model the local input density at convergence, these radii can be used directly, in variable kernel density estimation. The data density function at any neuron is assumed to be convex and a cluster of related data comprises one or more neurons. The data density function may have a single radius, e.g. a hypersphere. A processing engine and a method for developing a kernel-based topographic map which is then used in data model-based applications are also described. The receptive field of each kernel is disjunct from the others, i.e. overlapping. The engine may include a tool for self-organising and unsupervised learning, a monitoring tool for maximising the degree of topology achieved, for example, by using the overlapping kernels, and a tool for automatically adjusting the kernel widths to achieve equiprobabilism. Applications include variable kernel density estimation with the equiprobabilistic topographic maps, with density based cluster maps and with equiprobabilistic variable kernel-based regression. The receptive fields of the kernels may be convex, e.g. hyperspheroids or hyperspheres.

Description

TOPOGRAPHIC MAP AND METHODS AND SYSTEMS FOR DATA

PROCESSING THEREWITH

The present invention relates to systems and methods suitable for the analysis of multi- dimensional data sets from which it is desired to obtain relational information indicating attribute relationships in groups of data. Such information may be used for prediction, decision making, market analysis, fraud detection etc. The present invention relates to methods and systems, especially distributed processing systems.

Technical Background

Clustering is aimed at partitioning a given data set into subsets of "similar" data, ideally without using a priori knowledge about the properties or even the existence of these subsets. Many approaches have been suggested in the past and various optimality criteria developed in the statistics literature (Duda and Hart, 1973; for a recent overview, see Theodoridis and Koutroubas, 1998). Some rely on an estimation ofthe density function, others rely on a similarity criterion that needs to be optimized. Among the similarity-based clustering approaches are the popular k-means- (Krishnaiah and Kanal, 1982), the ISODATA-(Ball and Hall, 1967), and the fuzzy C- means clustering procedures (Dunn, 1974), which all assume that the shape of a cluster is hyperspherical (when the Euclidean distance metric is used), or hyperelliptical

(when the Mahalanobis distance metric is used).Evidently, real-world clusters may not comply with the particular shape assumed and, hence, clustering algorithms have been developed which do not make any (explicit or implicit) assumption about the shape of the clusters, such as the valley-seeking approach (Koontz and Fukunaga, 1972), the density gradient approach (Devijver and Kittler, 1982), the maximum entropy approach (Rose et al., 1990), and so on (for an overview, see Herbin et al., 1996). In the simplest case, first tht density function underlying the data is estimated, the high density regions located, their extent demarcated and identified with the clusters sought. This may be called "density-based" clustering. The clustering result can then be used for unsupervised classification purposes.

In the neural networks domain, similarity-based clustering is mostly pursued with the standard unsupervised competitive learning (UCL) rule, and the many rules derived from it: the converged neuron weights correspond to the cluster centers and the range spanned by each cluster corresponds to the neuron's receptive field (RF), in this case a Voronoi region (cf. nearest-neighbor rule). Rules like these are aimed at minimizing the mean squared error (MSE) between the neuron weights and the data points falling in the corresponding Voronoi regions. In fact, the UCL rule, when used in batch mode, has an intimate connection with the classic k-means clustering algorithm. This is also the case with the Self-Organizing Map (SOM) algorithm (Kohonen, 1982,1984,1995), when the neighborhood function has vanished (zero- order topology). In an attempt not to assume prior knowledge ofthe number of clusters when applying the SOM algorithm, graphical techniques have been devised (Kohonen, 1995, page 117), or distance-based metrics used for developing a series of SOM's, one for each cluster (Cheng, 1992). Another similarity-based clustering algorithm is the popular Adaptive Resonance Theory (ART) (Carpenter and Grossberg, 1992; Frank et al. 1998). The allocation of new neurons is performed when needed, depending on a "vigilance" parameter: when the current input is not sufficiently similar to any ofthe already allocated neuron weights, then a new neuron is recruited and its weight set equal to the current input. The meaning of '"sufficiently similar" depends on the user- defined vigilance parameter, the choice of which is often quite sensitive to input noise. Clustering has also been formalized in a more probabilistic framework, in an attempt not to make assumptions about the shape ofthe input distribution: the main principle is that each datum is assigned in probability to each cluster, a feature that is usually called 'fuzzy membership in clusters." This membership function definition has been adopted by Rose and co-workers (1990), in their optimal vector quantizer, and more recently by Graepel etal. (1997), in their soft topographic vector quantizer.

Density-based clustering is much less considered for unsupervised competitive learning algorithms, albeit that the SOM has been used for non-parametric density estimation purposes, but not for capturing the fine-structure ofthe input density (see Kohonen, 1995, p. 152) : the weight density yields an estimate ofthe input density, provided that there is a linear relationship between the two. Density function estimation should not be confused with probability distribution estimation or with (cumulative) distribution function estimation (also called repartition function estimation). However, contrary to what was originally assumed (Kohonen, 1984), the weight density is not a linear function ofthe input density (Ritter, 1991; Dersch and Tavan, 1995) and hence, the Voronoi regions will not be active with equal probability (no equiprobabilistic maps). Hence, even if a density estimate were developed for each class separately, based on their respective weight distributions, then the cluster boundaries one obtains by comparing the posterior class probabilities, are not going to correspond to an optimal "Bayesian" classification, unless equal class densities are assumed (Kohonen, 1995, p. 111). Since MSE minimization is not compatible with equiprobabilistic map formation in general (Van Hulle and Martinez, 1993; Ueda and

Nakano,1993), several heuristics have been devised in order to compensate for this discrepancy in competitive learning, e.g. by adding a "conscience" to the neuron's activation behavior. The competitive learning strategies that have been explored in conjunction with a "conscience"-based metric, for equiprobabilistic map formation, fall in two broad categories: 1) the activation probabilities ofthe neurons are monitored and used for adjusting the distance metric in the nearest-neighbor rule, and 2) the activation probabilities are monitored together with the local distortions, and the two are combined for adjusting the neurons' learning rates. Examples ofthe first category are Conscience Learning (CL; DeSieno, 1988), The Integer Markovian Artificial Neural Network (TInMANN, Van den Bout and Miller, 1989), and Frequency-Sensitive Competitive Learning (FSCL; Ahalt et al. 1990; Galanopoulos and Ahalt, 1996); an example ofthe second category is the learning scheme Bauer, Der and Herrmann (1996) (BDH algorithm) introduced for topographic map formation under the continuum limit. A still different approach to density estimation is provided by the popular (Gaussian) mixture model: the parameters ofthe model can be trained, in an unsupervised manner, so that they become the most likely ones for the given data set (maximum likelihood learning, Redner and Walker, 1984). For most cases, this achieved with the expectation-maximization (EM) algorithm (Dempster et al., 1977), or similar. Since a specific model is assumed for the component densities, which make up the mixture model, this approach belongs to the category of semi-parametric density estimation techniques (Bishop, 1995, p. 60). However, contrary to "conscience"-based competitive learning, it is not clear how this technique could be made compatible with topographic map formation. It is an object ofthe present invention to overcome some ofthe problems ofthe known systems, in particular to provide a more faithful representation ofthe input data which is suitable for non-parametric clustering. Summary of the present invention

An aspect ofthe present invention is an unsupervised competitive learning rule for equiprobabilistic topographic map formation, called the kernel-based Maximum

Entropy learning Rule (kMER) since it maximizes the information-theoretic entropy of the map's output as well as systems, especially distributed processing systems for carrying out the rule. Since kMER adapts not only the neuron weights but also the radii ofthe kernels centered at these weights, and since these radii are updated so that they model the local input density at convergence, these radii can be used directly, in variable kernel density estimation. The data density function at any neuron is assumed to be convex and a cluster of related data comprises one or more neurons. The data density function may have a single radius, e.g. a hypersphere.

Another aspect ofthe present invention is a processing engine and a method for developing a kernel-based topographic map which is then used in data model-based applications. The receptive field of each kernel is disjunct from the others, i.e. overlapping. The engine may include a tool for self-organising and unsupervised learning, a monitoring tool for maximising the degree of topology achieved, for example, by using the overlapping kernels, and a tool for automatically adjusting the kernel widths to achieve equiprobabilism. Applications include variable kernel density estimation with the equiprobabalistic topographic maps, with density based cluster maps and with equiprobabalistic variable kernel-based regression. The receptive fields ofthe kernels may be convex, e.g. hyperspheroids or hyperspheres but the present invention is not limited thereto.

Another aspect of the present invention is a processing engine and a method for two-step data mining of numerical data using a kernel-based topographic map. The map may be equiprobabalistic. The engine proceeds in two steps: in the first step the engine develops in an unsupervised manner a kernel-based and topographic map-based data model using numerical input data. In a second step the data model generated may be used for a variety of applications, such as clustering analysis, statistical feature extraction and model-based regression applications. A monitoring tool maximises the degree of topology preservation achieved during kernel-based topographic map formation.

The engines and methods described above may be local or distributed. The data model and the processing on the data model may be local or distributed. Another aspect ofthe present invention is density-based clustering and unsupervised classification for topographic maps as well as systems, especially distributed processing systems for carrying out the classification. By virtue ofthe topographic map learning rule, called the kernel-based Maximum Entropy learning Rule (kMER) in accordance with the present invention, all neurons of a neural network have an equal probability to be active (equiprobabilistic map) and, in addition, pilot density estimates are obtained that are compatible with the variable kernel density estimation method. The neurons receive their cluster labels by performing a suitable method, such as hill-climbing, on the density estimates which are located at the neuron weights only. The present invention also includes several methods for determining the cluster boundaries and the clustering may be used for the case where the cluster regions are used for unsupervised classification purposes.

Another aspect ofthe present invention is a modular data analysis system and method for the detection of clusters of related data, comprising processing engine running an unsupervised competitive learning algorithm to generate a kernel-based topographic map, wherein each kernel represents a data density value, the receptive field of any kernel being assumed to be convex and a cluster comprising one or more neurons. Following clustering, unsupervised classification of the clusters may be carried out. Topographic maps have been widely used for (supervised) classification purposes: the neurons ofthe converged maps are given class labels depending on their activation in response to samples drawn from each class. On the other hand, albeit that the presence of clusters in the input space are signaled by the presence of clusters in the neuron weight distribution, and classes can be identified with clusters (regions), topographic maps have been much less considered for unsupervised classification purposes, unless a priori information about the expected number of clusters is supplied. The present invention includes in one embodiment a method in which the neuron weights are labeled by performing hill-climbing on the density estimates located at the neuron weights. By virtue ofthe learning rule, called the kernel-based Maximum Entropy learning Rule (kMER), pilot density estimates can be readily obtained since kMER is aimed at producing an equiprobabilistic topographic map. Furthermore, several techniques with which class boundaries can be estimated from the neuron labels are included in the present invention. With kMER, a pilot density estimate can be readily obtained but, more importantly, it is expressed in terms ofthe kernel radii used in the VK method: as a result of this, the RF radii σ in kMER can be simply exchanged for the kernel radii in the VK method.

The present invention will now be described with reference to the following drawings.

Brief Description of the Drawings

Figure 1 shows a distributed network with which the present invention can be used. Figure 2 shows a further distributed network with which the present invention can be used.

Figure 3 shows: (A) Scatter plot of three times 300 samples drawn from three equally- probable Gaussians distributions (plusses, circles, and diamonds). (B) RF regions obtained after training a 24 x 24 lattice with kMER/?_r = 2 (lattice not shown).

(C) Grey scale representation ofthe non-parametric density model which corresponds to (B) (p_r = 2, p_s = 1).

(D) Idem but now for the optimal degree of kernel smoothness (p_r - 17.5, p_s = 1.5).

(E) Gray scale representation ofthe cluster "influence" regions found with the SKIZ technique (1 = 16) for (D) (see text).

(F) Cluster boundaries which correspond to (E), shown on top ofthe scatter plot (A). (G) Gray scale representation ofthe cluster "influence" regions obtained when the input space is labeled according to the label ofthe closest neuron (minimum Euclidean distance) after the neurons are labeled with a hill-climbing technique (see text).

(H) Cluster boundaries which correspond to (G), shown on top ofthe original sample distribution.

Figure 4 shows: Receptive field (RF) definition used in kMER. The arrow indicates the update ofthe RF center w,., given the present input v (not to scale); the dashed circles indicate the updated RF regions S¹,. and S_j. For clarity's sake, the range of the neighborhood function is assumed to have vanished so that w. is not updated. The shaded area indicates the overlap between S,- and S_j before the update.

Figure 5 shows an optimized kMER algorithm, batch versionin accordance with an embodiment of the present invention. . Figure 6 shows: Figs. 6A and B show Three Gaussians example. Dependency ofthe number of clusters found on the free parameters L ofthe SKIZ method (A) and k of the hill-climbing method (B). The presence of large plateaus are indicative ofthe actual number of clusters in the input distribution.

Fig. 6 C shows Iris Plants database example. Number of clusters as a function of the free parameter k.

Figure 7 shows a computationally-efficient tree-based hill-climbing algorithm in accordance with an embodiment ofthe present invention. Conventions: ID = the neuron's lattice number; Top = lattice number ofthe neuron which has the highest density estimate of all k + 1 neurons in the current hypersphere; Ultimate Top = lattice number ofthe neuron which represents a local maximum in the input density.

Figure 8 shows a scatter plot of a two-dimensional "curved" distribution.

Figure 9 shows monitoring the overlap variability (OV) during kMER learning on the sample distribution shown in Fig. 1 in accordance with an embodiment ofthe present invention. A. First simulation run. Shown are OF (thick continuous line), the neighborhood range <x_Λ (A-range; thin continuous line), and the topographic product

(TP) (thick dashed line). The thin dashed line indicates the optimal TP- value (zero).

For the sake of exposition, the neighborhood range is divided by 12.5 and the topographic product is multiplied by 10. B. Second simulation run; t_m ² _ax = 24 epochs

C. Fifth and "best" run: t_m ⁵ _ax = 124 epochs

Figure 10 shows: (A, B) Lattices obtained with kMER, without and with monitoring, using the sample distribution shown in Fig. 1 (M= 500 training samples). The dashed line indicates the theoretical principal curve. It should be noted that the neurons' RF circles are also indicated. (C) Lattice obtained with monitoring, using M- 4000 training samples.

Fig. 11 shows system components at Level 1 during training in accordance with an embodiment ofthe present invention.

Figure 12 shows system components at Level 1 during application and at Level 2 during training. Shown is the th Level 2 subsystem: it is trained on the spectra that receive the zth cluster label at Level 1, F¹ (1, i, s). The Level 2 subsystem's components during application are similar to those shown for Level 1.

Figure 13 :shows a Plot of M S E[p_Λ , p_p ^* j as a function of p_s at Level 1, for all

621 music spectra (A), and Level 2 for the 69 "Ab" labeled spectra (B), both for k_PC = 16 PCs.

Figure 14 shows a plot of the number of clusters found as a function ofthe number of nearest-neighbor neurons used in the hill-climbing procedure at Level 1 (A) and Level 2 for the "Ab"-labeled spectra (both for k_PC = 16).

Figure 15 shows a cluster map obtained at Level 1 (k_PC = 16). The neurons of the 20 x 20 lattice that belong to the same cluster are labeled with the same gray scale.

Figure 16 shows a thresholded and normalized spectrogram of eight notes played on three musical instruments consecutively with uniform white noise added

(SNR = 12.13 dB). In total 621 music spectra are computed (see text). The labels these spectra receive at Level 1 are shown on top ofthe spectrogram using gray scales defined at the right (k_PC = 16). The labels "F" to "Eb" represent the notes played by the musical instruments; the label "Tr" represents the transient and transitional parts of the music signal.

Figure 17 shows a cluster map obtained at Level 2 for the spectra labeled "Ab" at Level 1 (for k_PC = 16 PCs). The lattice is sized 7 x 7 neurons.

Figure 18 shows a spectrogram of Fig. 6, but for which the "Ab" labeled music spectra are now further processed and labeled at Level 2 (k_PC = 16). It should be noted that the Level 2 labels correspond to the instruments that play the note "Ab"; the remaining spectra are labeled as "none". Other conventions are as in Fig. 6.

Figure 19 shows a clustering tree. The instruments are indicated with subscripts; the "F" note with the higher pitch is labeled as F. The leaf nodes are highlighted with an ellipse drawn around the labels. Figure 20 shows embodiments ofthe present invention demonstrating Vertical modularity. (A) The data model and the regression model are developed separately and operate in sequence. (B) Several regression models can be grafted onto a common data model.

Figure 21 shows Kernel-based maximum entropy learning and the data model in accordance with embodiments ofthe present invention. (A) Definition of the type of formal neurons used. Shown are the activation regions 5, and 5, of neurons i and (large circles) of a lattice A (not shown). The shaded area indicates the intersection between regions 5, and S_j (overlap). The receptive fields centers w and w,, are indicated with small open dots. (B) Neuron i has a localized receptive field kernel K(x - w„ σ;), centered at w, in input space Vςz ^<iR^d. The kernel radius corresponds with the radius of activation region S_{. The present input x e V is indicated by the black dot and falls outside S,.

Figure 22 shows an optimized kMER algorithm, batch version, adapted for the case where the values of some input vector components are missing (marked as "don't cares" or x).

Figure 23 shows an algorithm for determining missing input vector components in accordance with an embodiment ofthe present invention.

Figure 24 shows a kernel-based regression of a function

x = [x ..., x_d] using a two-step procedure: the data model and the regression model in accordance with an embodiment ofthe present invention. The data model comprises a topographic map of N neurons of which the activation regions are determined with kMER (location w, and radii σj). The regression model consists of N Gaussian kernels, with the same location and (scaled) radii as the activation regions in the data model (cf. the icons next to g, and g_N). The outputs of these kernels, g_ls ..., g_N, are weighted by the parameters W , ..., W_N so as to produce the regression model's output O. The W_x, ..., W_N are trained with the delta rule in such a manner that the output O represents an estimate y ofthe function to be regressed, y =βx) (cf. the icon on the top).

Figure 25 shows embodiments of the present invention having horizontal modularity. Partitioning ofthe data set into subsets ofthe input space data set (A), or into subspace data sets (B), and a mixture of the two (C).

Figure 26 shows a further embodiment ofthe present invention having horizontal modularity. Case of several subsets and one regression module.

Figure 27 shows an embodiment ofthe present invention having multiple subsets and multiple regression modules. The latter are arranged at two levels.

Figure 28 shows an embodiment ofthe present invention having multiple subsets and regression modules and a single decision module.

Figure 29 shows an embodiment ofthe present invention having multiple subspaces and multiple regression modules at two levels.

Figure 30 shows an embodiment ofthe present invention involving a mixed case, heuristic strategy: multiple subsets of complete and incomplete input vectors. The presence of incomplete data vectors is ignored.

Figure 31 shows a further embodiment ofthe present invention involving a mixed case, correct strategy: the presence of complete data vectors is ignored.

Description ofthe illustrative embodiments

The present invention will be described with reference to certain embodiments and with reference to certain examples and to certain drawings but the present invention is not limited thereto but only by the claims.

1. Distributed processing systems

Fig. 1 shows a basic network 10 in accordance with an embodiment ofthe present invention. It comprises two major subsystems which may be called data processing nodes 2 and data intelligence nodes 3, 4. A data processing node 2 comprises one or more microprocessors that have access to various amounts of data which may include very large amounts of data. The microprocessor(s) may be comprised in any suitable processing engine which has access to peripheral devices such as memory disks or tapes, printers, modems, visual display units or other processors. Such a processing engine may be a workstation, a personal computer, a main frame computer, for example or a program running on such devices. The data may be available locally in a data store 1 or may be accessible over a network connection such as via the Internet, a company Intranet, a microwave link, a LAN or WAN, etc. The data may be structured such as is usually available in a data warehouse or may be "as is", that is unstructured provided it is stored in an electronically accessible form, e.g. stored on hard discs in a digital or analogue format. The processing engine is provided with software programs for running an algorithm in accordance with the present invention to develop a topographic representation ofthe input data. Particularly preferred is a competitive learning algorithm for producing a topographic equiprobabalistic density representation (or map) of the input data having a linear mapping of real input data densities to the density represented by the topographic representation (the algorithm is described in more detail below). This topographic map may be described as a graph 7 of data models that represent the data processed. Each topographic map is a data structure in accordance with the present invention. The node 2 can use a persistent medium to store the graph of data models 7, or it can send it directly to other nodes in the network, e.g. nodes 3 and 4. Optionally, the data processing node 2 can be distributed over a network, for example a LAN or WAN. A data processing node 2 can run on most general purpose computer platforms, containing one or more processors, memory and (optional) physical storage. The supported computer platforms include (but are not limited to) PC's, UNIX servers and mainframes. Data processing nodes 2 can be interconnected with most existing and future communication links, including LAN and WAN systems. A data processing node 2 can also have a persistent storage system capable of storing all the information (data) that is been used to generate the graph 7 of data models that are stored or maintained by this node or a sub-graph of the graph 7. The graph 7 of data models can be saved regularly to the persistent medium for the analysis and monitoring of changes and the evolutions in the data (e.g. trend detection).

A data processing node 2 can also run other or additional algorithms that can be used to process data or pre-process or prepare data ready for analysis (e.g. to capture time dynamics in data sets). The data sets that are used can be both structured data (e.g. databases) or unstructured data (e.g. music samples, samples of pictures, ...). The only limitations is that the data should be offered in a format that can be processed by the chosen computer platform as described above.

A data processing node 2 can provide a data intelligence node 3, 4 with an individual data model, a sub-graph or the complete graph 7 of data models. Normally, only the data models are returned not the data itself. However, it is (optionally) possible to return also the data that is used to generate the data models.

The graph of data models that is build up and maintained by a data processing node 2 contains: 1) A number of datanodes that contain the data models that describe at least a part of the data set. 2) A number of directed links between the datanodes. Note: datanodes should not be confused with nodes of a distributed system such as 2,

3, 4 of Fig. 1. The word "Datanode" refers to a neural network which models at least a portion ofthe input data to be processed. A datanode may be a software node. A datanode is a neural network which models a part of the topographic equiprobabilistic density map, for example, a part generated by clustering after application of the novel competitive learning algorithm in accordance with the present invention.

The datanodes and directed links are preferably organized in a hierarchical system, that is in a tree, in which topographic maps from two or more levels overlap.

A tree of datanodes is shown schematically in Fig. 19 and its generation described with reference thereto.

There are a few special types of datanodes:

1) The top level datanode contains the data model from the complete data set and from this node, all the other datanodes in the tree can be reached from this top level via the directed links. It has only originating directed links - it has no terminating single directed link from another datanode.

2) A leaf datanode is a datanode that has no originating, single directed links to other datanodes. In a system that has gone through the complete initial training phase in accordance with the competitive learning algorithm in accordance with the present invention, this means that the data in this datanode has a substantially homogeneous distribution and that it is not relevant to split the data further. It can be said that a leaf node cannot be resolved into clusters which can be separated, that is it is "homogeneous", or any discernible clusters are of such minor data density difference that this difference lies below a threshold level.

All the other (intermediate) datanodes in the tree between the top datanode and a leaf datanode describe: a) A subset ofthe complete dataset with common characteristics, but that can be refined further (i.e. without a uniform distribution). b) The data model ofthe data described by this datanode.

During an initial training phase the following steps are taken which results in the formation ofthe datanode tree from top-down: a) The top-level model is generated (i.e. generation ofthe top datanode). b) This top-level model is divided in several parts in accordance with rules described in detail below. The division is not a simple tiling of the top level topographic map. c) Each of these parts describe a subset ofthe complete dataset and a data model is generated for this subset, i.e. either intermediate or leaf datanodes or a mixture of the two is generated. d) The data described by an intermediate datanode can be divided further into other intermediate datanodes and/or leaf datanodes. If it is not possible to divide an intermediate datanode further, this intermediate datanode is a leaf datanode.

After the initial run, the graph produced is a tree, from top-datanode to leaf datanode. One ofthe additional advantages ofthe data processing nodes 2 in accordance with the present invention is the capability to distribute functions over the network 10.

There are several advantages in distributing the data processing nodes across the network 10:

a) Higher performance. Although the algorithms in accordance with the present invention are capable of running on parallel computer systems as they are almost linearly scaleable, additional computing power may be advantageously provided by another computer. b) Privacy: for example, the European privacy rules for personal information limit the transfer of personal information that is collected between companies. For data mining purposes, the characteristics of individuals are not usually important but the higher-level information or attributes of these individuals. c) Integration or cost reasons. It is sometimes not feasible to install a complete data store, e.g. a data warehouse in one location.

The system in accordance with the present invention can fulfill all these requirements through a number of different distribution schemes. Several combinations of different distributed network schemes are possible of which two are described below with reference to Fig. 2 which shows a more complex network 20 than that of Fig. 1. In a system 20 designed for master/slave processing of sub-graphs with common data source in accordance with an embodiment ofthe present invention, new data or data that is used to retrain the data models is provided to a master data processing node such as 15. Optionally, this node 15 retrains the main data model (top datanode in the graph of data models). In addition, it may determine to which intermediate datanode(s) this data point belongs and send it to the processor(s) responsible for this (these) intermediate datanode(s). Such an intermediate datanode can belong to another data processing node, somewhere else in the network 20, called a slave data processing node 12 - 14.

As the processing requirements increase, it is possible to add in additional slave data processing nodes such as 11 to the network 20. A slave processing node can also act as a master processing node for another slave processing node depending on the network configuration, e.g. in Fig. 2, the nodes 11, 12 are slaves ofthe master data processing node 13 on top ofthe data, but this node 13 is also in a slave relationship to the master node 15.

The data modeling methods ofthe present invention may be used with parallel processing. For example, a distributing master processing node 16 collects all the initial data and carries out top level clustering. It determines a plurality of independent clusters, e.g. two. It decides to process the first ofthe clusters and send the second cluster with its associated data to the processing node 15 for parallel processing of the data. Data updates will be received by the distributing master node 16 and the master node 16 determines if the data is to be processed by itself or by the alternative processing engine in node 15. To do this, the master node 16 keeps a mapping between the records and the relevant processing engine.

In an alternative embodiment, the tree structure ofthe graph 7 of data models may be at least partly mapped to a hierarchical master/slave network 20 as described above. A specific embodiment of the use of a master/slave network will now be described. Initially, the complete graph 7 of data models is set up on every processing node. After this, the individual processing nodes assume their allotted task of processing their specific part ofthe tree. For this purpose they each receive the relevant data subset from the master processing node. From the other nodes in the system, each processing node receives the weighting factor updates for the neurons other than the ones the node processes. With the updated factors introduced into the graph 7, each node can process new data while accurately with the influence of other intermediate datanodes and leaf nodes being modeled correctly. As only weighting factors are shipped around the network and not data, the amount of signal traffic is much reduced. This aspect of the present invention is a direct consequence ofthe linear density mapping of the topographic map generated by the competitive learning algorithm in accordance with the present invention. Only if the data which is associated with a cluster or clusters can be separated cleanly and accurately from the total data is it possible to part process safely the graph 7 in a distributed way. If the data density estimation is only approximate, the labeling of data to be associated with a specific cluster/neurons is inaccurate. This means that a significant proportion of wrong data is shipped to each slave node (that is data which should not be contributing in the processing ofthe datanode processed on this processing node but should be contributing on another node). This falsifies the results at each node. As only the weighting factors are shipped around the system and these may be sensitive to the incorrect data, the whole graph 7 may become an incorrect decision making tool within a short period of time. In accordance with a further embodiment ofthe present invention slave/slave processing of a complete graph without a common data source but with a centralization point (Master node) is provided. The total graph 7 of data models can be generated by several data processing nodes in a slave/slave configuration that have individual access to different (separated) data sets. In one embodiment the different slave data processing units process their own data and send the updates (revised neural weighting factors) regularly to a central data processing node, that unifies the different updates into a single consistent data model. There are two types of distribution of data:

1) Horizontally distributed: the different datasets describe the same characteristics (i.e. columns in database table), but different subjects (i.e. different records in this table).

An example: the customer database of a multinational company will be distributed over several databases in the different countries, each database describing the same attributes but only ofthe local customers. 2) Vertically distributed: the different datasets describe different characteristics ofthe same (group of) subjects

An example: the salary database, the human resources database, the restaurant database all for the same employees of one company. A combination of horizontal and vertical distributed data is also possible. The way that a distributed network 20 as shown in Fig. 2 may be operated with vertically or horizontally distributed data form separate embodiments ofthe present invention. With horizontal distribution, each local processing node processes its own data. If it is necessary to query the whole data model, it is possible to devise queries which are sent around the network and collect the answers from each processing node and bring it to the questioning node, e.g. the master node. Schemes in accordance with the present invention which involve local processing of distributed data and local generation ofthe data model may be applied for cost or time-to-install reasons, i.e. to eliminate the need for collecting together a centralized data warehouse or for privacy reasons, if, for instance, the different data sets belong to different companies. This exemplifies an aspect ofthe present invention which is to leave data where it is and to only ship queries, answers or at most abstracted versions ofthe data (data models) around the network. For instance, an alternative embodiment involves generating a data model locally from the local data at one node and retrieving only the data models from other processing nodes. This would mean, in the example above that the data models from various countries would be brought together at one processing node. This would allow querying of all these models on a single computing platform. This may seem wasteful in data transfer in comparison to only sending the query around the network and collecting the answers. However, there may be advantages in doing so - e.g. the answers and the queries may be kept more private or the query is ofthe type that all the data must be present at one location. Anyway, it is still less data transfer than collecting and updating a data warehouse. It is also possible to separate some other data processing functions that are not really related to the generation and maintenance ofthe graph of data models. For example, a data processing node 16 can serve only as a device that collects the graph of data models from the other data processing nodes, and updates the data analysis nodes if required. Data intelligence nodes provide the additional logic (called application components) needed to run the applications such as the following. The data analysis system in accordance with embodiments ofthe present invention can be used in the following non-limiting list of applications: direct marketing, quality assurance, predictions, data analysis, fraud detection, optimizations, behavior analysis, decision support systems, trend detection and analysis, intelligent data filtering, intelligent splitting of data sets, data mining, outlier detection. Application components relate to programs run to analyze the graph 7 of data models or part of it to obtain a useful, concrete and tangible result. It is an aspect ofthe present invention that the data model generation is kept separate from the applications. When the application is partly mixed into the data model generation, the data model becomes restricted in its use. One aspect of the present invention is that the data model is as neutral as possible, i.e. it is not determined by the specifics ofthe data nor by the specifics ofthe application. This allows the data models in accordance with the present invention to have a longer useful life. New queries and new problems can be investigated on the data model. Due to the linear mapping ofthe real data densities into the topographic map, the data model is an accurate representation ofthe data despite being highly compressed in size. There are basically two different data intelligence nodes (Fig. 2):

1) The Data intelligence node 17, 18 for machine-machine interaction is designed to offer data mining intelligence to other applications.

2) The Data intelligence node 4 for exploratory data mining applications, that allows a human analyst 5 to analyze data interactively.

A data Intelligence node 17, 18 for machine-machine interactions is a node containing one or more processors that has access to at least a sub-graph ofthe graph 7 of data models as generated by a data processing node. The node 17, 18 contains the predefined logic to execute a specific application as needed by another computer system. For this purpose, this node 17, 18 offers the possibility to answer queries relating to these specific applications through a number of standardized machine-to- machine interfaces, such as a database interface (ODBC/JDBC), middle-ware interfaces (CORBA, COM) and common exchange formats (text files, XML streams, HTML, SGML).

Some ofthe applications that can be offered through this system include:

• Prediction: Predict how long this customer will be loyal • Decision support systems: Should this customer be accepted?

• Decision support systems: What is the maximal loan that this customer should be allowed

• Intelligent data filtering: Is this picture significant enough to be shown to a human analyst?

• Direct marketing: which advertisement should be shown/sent to this user.

• Quality assurance: does the answer/product/... that is offered to the customer, contain errors based on experience from the past?

• Fraud detection: Should this sales order entered through the internet be accepted, or is additional verification necessary?

• Intelligent splitting of data sets: save a (huge) amount of data in such a way that data with the same characteristics is saved together, and that these groups of data can be distributed over the network.

• Optimization: what is the chance that this data entry will be requested again, so that it can be stored in a faster accessible way.

• Data analysis: return the most similar data entries.

This node 17, 18 can be connected to a data processing server 16 that serves at least a sub-graph ofthe graph 7 of data models using any LAN or WAN connection. A constant connection is not required, a data intelligence node with a persistent storage system can store the (graph of) data models locally and an application component can be made available that can detect if the data model has to be resynchronized. A node 17, 18 can run on the same platform as a data processing node, it can run as a part of another application or it can run on small handheld devices (small computers, mobile phones). It is also possible to combine a data processing node and a data intelligence node on a single (physical) machine.

A data Intelligence node 4 for exploratory data mining (machine-man interface) is a node containing one or more processors that has access to a at least a sub-graph of the graph 7 of data models, analogue to a data intelligence node 17 18 for machine- machine interface, but this node 4 also requires a visualization device that allows the user to browse in the graph 7 of data models and the results provided by the data models. The user can analyze the data and run an application component for selected data (e.g. any ofthe application described above). Specific application components can help the analyst to detect specific behavior patterns.

Some typical applications that are offered though this interface are:

• Data mining: what are the typical customer profiles, how many types of customers are there in the database, etc .... ?

• Outlier detection: what are the exceptional customers that do not fit in a profile?

• Fraud detection: detect exceptional behavior that can be an indication of fraud.

• Data mining: detect typical behavior patterns.

• Trend detection and analysis: analyze how the evolution in the graphs of data models that are stored regularly to detect and analyze trends and evolutions.

In addition to these typical applications, it is possible to run most ofthe applications described in the previous type of data intelligence node in an interactive mode. In accordance with the present invention the data to be processed is first prepared (if this necessary) and then subjected to the novel competitive unsupervised competitive learning rules (kMER) in accordance as part of data density-based clustering. Then a non-parametric density model is built up from the neuron weights and the radii obtained by kMER. The identification ofthe clusters in the density estimate and the regions spanned by them, i.e. their "influence zones" may be performed by any of a number of clustering techniques all of which represent individual embodiments ofthe present invention. Below two are described: the SKIZ technique (Herbin et al., 1996; Skeleton by Influence Zones, see Serra, 1982) and the preferred embodiment (best mode) ofthe present invention using hill- climbing. In this novel clustering procedure for kMER, first the RF regions are clustered by applying a hill-climbing technique on the density estimates located at the neuron weights, and then the cluster regions are demarcated.

The clustering is done to reveal occult relationships within the data, i.e. the clustering is done without assuming knowledge about the number of clusters or their shapes beforehand. The cluster regions are then used for unsupervised classification purposes.

In the following aspects ofthe present invention will be explained in detail, namely:

• the basic unsupervised competitive learning algorithm. • non-parametric density estimation to form a non-parametric model ofthe input density including smoothing, and in particular an automatic smoothing method.

• cluster identification by two separate clustering methods of which the second - hill- climbing - is considered the best mode of carrying out the invention. • unsupervised classification to determine which data belongs to which cluster

• monitoring to achieve topology preservation

• integration of all these into a system which is suitable for distributed processing,

• and examples of regression using the system.

An example is given of the complete procedure of obtaining the datanode tree (Fig. 19) from an unstructured data signal (music). The skilled person will appreciate that the systems and methods can also be applied to the analysis of structured data such as databases.

2. Density-based clustering with kMER and SKIZ Clustering in accordance with an embodiment ofthe present invention may be performed starting from either structured or unstructured data. Firstly, a topographic map is developed with kMER for a given set of M data samples. Secondly, a non- parametric density model will be built from the neuron weights and their radii obtained with kMER at convergence. Third, the SKIZ method will be used for identifying the number of clusters in the density estimate and their "influence zones".

For the sake of explanation, consider the clustering example used by Herbin and co-workers (Herbin et al, 1996): M= 900 samples are drawn from three equally- probable Gaussians centered at (-0.4, -0.3), (0.4, -0.3), and (0., 0.3) in the unit square [-1, l]², with the standard deviations all equal to 0.2 (Fig. 3 A).

2.1. Stage 1: kMER

Consider a lattice A with a regular and fixed topology and with arbitrary dimensionality d_A in -dimensional space Vςz 5R^d. To each neuron i a hyperspherical RF region S, is associated, with center w, and radius σ_{ (Fig. 4). Hence, the receptive fields have been truncated to receptive field regions. If the input v falls in 5„ then σ_{ will be decreased and w, shifted towards v, else σ_t will be increased. In order to formalize these events, with each S, is associated a code membership function:

Since the RF regions usually overlap (Fig. 4), a fuzzy code membership function is defined (Zadeh, 1965):

^Ξ'^{(v) = !}' Λ ^» *^{i e A> (2)}

so that 0 < Ξ,(v) < 1 and Σ, Ξ,(v) = 1. Depending on the neural activation state, the weights w, and radii σ, are adapted by using two learning rules which together form kMER. The mathematical details, including a proof of convergence are given elsewhere (Van Hulle, 1998). The RF centers w, are updated proportional to Ξ, and in the general direction of v (Fig. 4):

Δw, = r7∑A(i,J,σ_A)Ξ_j (v) Sgn(\ - , ), V/ e A, (3)

with Sgn(.) the sign function taken componentwise, σ_A the neighborhood range (in lattice space coordinates) ofthe neighborhood function A (cf. position ofthe dashed circle S_t in Fig. 4). By virtue of Ξ„ there is also a competitive element in the learning process: inputs that are shared by different neurons (shaded region in Fig. 4) will lead to smaller weight updates. As a result, the RF centers will tend to be "pulled" apart by the unbalanced weight update strengths.

The kernel radii σ, are updated in such a ways that, at convergence, the

probability for neuron to be active _P(l, (v) ≠ o) = — , V , with p a scale factor:

N

&σ, Vi e , (4)

pN in "on line" mode, with p_r = — — (Fig. 2). Hence, the purpose of kMER is to

N - p generate, at convergence, an equiprobabilistic topographic map or, in other words, a map for which the neurons maximize the (unconditional) information-theoretic

entropy (whence the rule's name). This implies that, since _P(I, (v) = l) = — at

convergence, also: (w, ) = — ^-— (5) N Vol(σ_t) can be obtained, where p(.) is a density estimate, which can be located at the RF center, and Vol(.) is the volume ofthe hypersphere, with radius σ_t, of neuron z's RF region. This capacity ofthe σj's to generate an initial ("pilot") estimate ofthe probability density p(\) will be explored in connection to variable kernel density estimation. It should be noted that eq. (5) is reminiscent ofthe Ath-nearest-neighbor method (Silverman, 1992), with & the number of samples, out of a fixed set of Min total, that activate each one of M regions used for obtaining the density estimate.

In "batch" mode, given a training set of M of M input samples, the RF centers are updated as follows:

Δw, = 7 ∑ ∑Λ( _>7, σ_Λ) Ξ ) 5gτι(v^ - _I ^.), Vi, (6)

and the radii a:

The optimized algorithm is given in Fig. 5. The time-complexity is 0(NMd). Returning to the 3 Gaussians example and a N= 24 x 24 planar lattice is trained until 10,000 epochs have elapsed; for the learning rate, η- 0.001 is taken. The N weights are initialized by sampling a uniform grid in the unit square [-1, l]², and the radii are initialized by sampling the uniform distribution [0, 0.1]. A Gaussian neigborhood function A is used and its range is decreased as follows:

with t the present time step, t_max the maximum number of time steps, and σ_A0 the range spanned by the neighborhood function at t = 0; with t_max - 10,000 and σ_κo = 12. The RF regions obtained for p = 2 are shown in Fig. 3 B.

2.2. Stage 2: Νon-parametric density estimation

Once kMER learning has terminated, the converged lattice can be used for building a non-parametric model ofthe input density p( ). Two approaches can be adopted in practice. First, one can determine the winning frequencies of all neurons for a given data set and assume fixed, equally sized volumes for the corresponding RF regions (Voronoi's or hyperspheres as in kMER's case). This will lead to the Parzen window technique which allocates fixed-radii kernels to obtain the density estimate p

(fixed kernel density estimation):

with K a fixed volume, fixed radius kernel (here, a Gaussian) with center w, and Z a normalizing factor. The parameter/^ controls the degree of smoothness ofthe resulting density estimate and has to be selected (see below).

Second, the RF region volumes can be determined in such a manner that they yield equal winning frequencies (i.e. equiprobabilism). This is basically what kMER does and it leads to variable kernel density estimation. Since the lattice generated by kMER not only consists ofthe neuron weights but also the radii ofthe neurons' RF regions, with the radii adapted to the local input density, the method can go beyond the Parzen window technique and cast density estimation into a format which is similar to the one used for variable kernel density estimation (VK) (Silverman, 1992) :

when Gaussian kernels are used, with p_s a factor with which the radii σ_t can be scaled,

[K(.,.)dv _! and Z, a proper normalizing factor so that — = — . In the case ofthe 3

Z, N

Gaussians example, the variable kernel density estimate, which corresponds to Fig. 3B, is shown in Fig. 3C. It can be easily verified that both p_s and p_r control the degree of smoothness ofthe resulting non-parametric density estimate and, in the next subsection, a way to choose them in an optimal manner is disclosed.

2.2.1. Automatic choice of smoothing parameter Fixed kernel estimate

In Silverman (1992) a range of techniques are reviewed with which the "optimal" degree of smoothing can be determined. The most widely-used technique, and which allows for a completely automatic procedure, is the least-squares cross- validation method. It can be applied to both fixed and variable kernel density estimation, at least in principle. The idea is to choose p_s in such a way that the integrated squared error:

is minimized. This is equivalent to minimizing the score function:

Mo( _s) = lp²( ) dv -^∑p._l (yv₁), (12)

with jσ_, the density estimate constructed from all data points except w, (leave-one-out

method). The integral can be approximated as — — ∑ Λ^"(2) (w, - ) , with h the

kernel bandwith and K the convolution ofthe kernel with itself. When the kernel is Gaussian, and when fixed kernel density estimation is performed, it can be shown that, asymptotically, whenN— » oo, the theoretically best choice of p_s is obtained (Stone, 1984). However, despite of its strong attractiveness, cross-validation may yield poor results. When a nearest-neighbor classification ofthe input samples is performed, the input samples become discretized and, even though the lattice may be fine-grained, cross-validation may choose the degenerate smoothness value p_s = 0.

Variable kernel estimate

In the case of variable kernel density estimation, as done for kMER, the score function eq. (11) cannot be simplified. In addition, the use of a Gaussian becomes rather inappropriate here because ofthe computationally expensive evaluation ofthe exponential in the Gaussian kernel definition. For these reasons, a different strategy is derived.

For the sake of exposition, assume first that p_r is kept fixed and that we only p_s is optimized. A fixed kernel estimate is introduced which is obtained by allocating kernels at the w,'s, with their radii equal to the average kernel radius obtained with kMER:

with σ* = < σ, > _A and Z the usual normalizing factor. The following MSE metric is then defined:

so that the "optimal" /?_s-value (with respect to the fixed kernel case) then becomes:

p_sopt = aτg minM S E (i₅)

This metric is derived from the following heuristic. In the vicinity of the optimal degree of smoothing ofthe variable kernel estimate, the MSE performance ofthe fixed kernel estimate can be considered as an upper bound for the variable kernel performance since in the former case, it is assumed that all radii are equal and, thus, that the distribution is locally uniform. Hence, by minimizing the MSE for the fixed kernel estimate, the maximal MSE error made for the variable kernel estimate will be lower. Hence, in other words, a least-squares cross-validation is performed by constructing fixed and variable kernel density estimates, and by minimizing the discrepancy between the two, i.e.

p * , p_p ), thus with respect to _s, for obtaining the optimal value, p_{s opt}. The whole procedure is then repeated along the p- axis to obtain the optimal mix (p_ropt , p_{s op} - The above method is an embodiment ofthe present invention.

Finally, on turning to the 3 Gaussians example, the variable kernel density obtained, when optimizing for p_s and _r in steps of 0.1 and 2.5, respectively, is shown in Fig. 3D.

2.3. Stage 3: SKIZ

Once the density estimate is determined, the Skeleton by Influence Zones (SKIZ) technique (Serra, 1982) is applied in order to identify the number of clusters and the regions spanned by them. The technique proceeds as follows. First, each input dimension is discretized into b bins and L levels /,, ..., l_L are defined between 0 and max^ p ) for thresholding p . These levels are used for identifying the clusters in the following manner. Let i?, be the set of points in -space with a density level higher than /,. At the lowest threshold level, /, = 0, only one connected region of points appears. If at the next level, l₂, the previous connected region appears to be split, the initial cluster is split into a new set of clusters, and so on. For each connected region a distance function is computed. The "influence zone" of a connected region contains all points for which the associated distance function is smaller than that ofthe other regions (e.g. using the city-block distance metric). The border between the "influence zones" is marked. If none of the existing connected regions is split at the next threshold level, then these regions and their borders are kept. Finally, when the highest threshold level has been processed, the connected regions are identified with the clusters sought. This whole procedure is then repeated for different values of : the intrinsic dimensionality of the clustering problem is then identified with the longest plateau in the dependency ofthe number of clusters on L. When applying the foregoing, using a 50 x 50 bins SKIZ map, on the density estimate ofthe 3 Gaussians example (Fig. 3D), the dependency ofthe number of clusters on L is obtained and shown in Fig. 6A. It can be seen that there are three clusters, since the plot shows a quite distinct plateau. L = 16 is chosen. The corresponding "influence" regions are shown in Fig. 3E (gray shaded areas), and the cluster boundaries in Fig. 3F. The clustering performance can now be tested in a quantitative manner by regarding the three Gaussians as (unknown) class densities, and by performing unsupervised classification ofthe samples drawn from them.

3. Density-based clustering with kMER and hill-climbing A serious disadvantage ofthe SKIZ technique is its computational complexity.

In order to determine the influence regions, the input space needs to be discretized. Hence, if each input dimension is partitioned into, e.g., b bins, then b^d will need to be scanned and processed when the input space V is (/-dimensional. Furthermore, the range spanned by the input distribution p( ) must be known in order for the input discretization to be effective. For these reasons, a more computationally-efficient procedure will now be disclosed, which is an embodiment ofthe present invention. Assume that the input density estimates are disposed at the neuron weights, Ψ_P ( ₍) I ^{z =} v,N) • For each neuron i, a hypersphere at w, is allocated, the radius of which is chosen in such a way that it contains the k nearest neuron weights. For this set of k + 1 neurons in the hypersphere, the neuron is sought with the highest density estimate and its lattice number is noted, e.g. neuron . It can be said that neuron i "points" to neurony, which will be called the "top" neuron. This procedure is repeated for each neuron and those top neurons are sought which point towards themselves. The density estimates of these neurons then represent local maxima in the input density, and hence, these neurons belong to separate clusters. These will be called "ultimate top" neurons. Cluster labels are assigned to these neurons and the "points to" relations of the other neurons are scanned in order to determine to which ultimate top neurons they belong, and hence, which cluster labels they should receive. Hence, in this way, the number of clusters present in the density estimate has been determined, for the given choice of the hypersphere parameter k, and the neurons have been labeled accordingly. A computationally-efficient version ofthe hill-climbing algorithm is given in Fig. 7. The present invention is not limited to hyperspheres. It is necessary to decide how to classify the input samples v. Indeed, if it is required to use the clusters for (unsupervised) classification purposes, it has to be decided how a given input sample v should receive its class label. Various sample labeling methods can be conceived among which the following possibilities are considered. For example, the neuron can be considered which is the closest, in Euclidean terms, to a given input sample, and that sample can be classified into the class to which the closest neuron belongs. This may be called the minimum Euclidean distance procedure (minEuC). Another possibility, but which needs an additional parameter, is to apply nearest-neighbor classification (NNC): a hypersphere is defined centered at the input sample and with a radius chosen in such a way that it contains k_m nearest-neighbor neurons. The class labels which make up the majority of all class labels in the hypersphere then defines the input sample's class label. A third possibility is aimed at performing Bayesian classification (BayesC). The idea is to determine the class-conditional density estimates separately, combine them with the estimated class probabilities (i.e. the number of neurons in each class), and select the class for which the posterior probability is the largest. A fourth possibility is to determine the cluster means, determine for each input sample the closest cluster mean, and label the input sample accordingly (meanC). All of these methods are embodiments ofthe present invention.

For the sake of clarity, the stages involved in density-based clustering with kMER and hill-climbing are listed, which represents a further embodiment ofthe present invention.

3.1. Stage 1: kMER This stage is completely identical to that ofthe SKIZ-based approach.

3.2. Stage 2: Non-parametric density estimation

This stage is also identical except that only a spatially-discrete estimate, located at the neuron weights is retained from the density estimate. Contrary to the SKIZ method, one can choose to minimize the effort in determining the optimal smoothness parameters since, basically, the density estimate is only further used for defining the class labels ofthe neurons, and not for defining the class regions or boundaries.

3.3. Stage 3: Hill-climbing

Returning to the 3 Gaussians example. Since there still exists a free parameter k, i.e. the number of nearest-neighbor neurons around each neuron, in order to motivate its choice, the number of clusters found as a function of A: is plotted (Fig. 6B). It is clear that there are 3 clusters in the density estimate (longest plateau in the plot) and take k = 50. Finally, the cluster regions and boundaries are determined using the minimum Euclidean distance procedure (minEuC) (or using one ofthe other sample labeling methods suggested earlier).

4. Unsupervised classification performance 4.1. Gaussians example

Since the three Gaussians overlap, classification errors will be made: for the input samples shown in Fig. 3A, the error rate is 5.11% for the SKIZ case (Fig. 3F), and 5.78% for the hill-climbing case (Fig. 3H). The theoretically-expected Bayesian error rate is 5.43%. In order to be able to evaluate this performance in a statistical sense, the foregoing is repeated 11 times, starting from 11 randomly generated sample sets in total that are used for training the lattices. Clustering is then performed both with the SKIZ and the hill-climbing (HC) methods. For the latter, in addition to the minEuC labeling method, the meanC, the NNC and the BayesC methods discussed earlier are also used; for NNC, k_m is taken equal to 10 (5 or 20 did not yield significantly different results). Then the misclassification rates are determined on a ten times larger test set, the misclassification rates obtained are ranked, and the 6th one, i.e. the median is ranked. The median is taken, and not the mean, since otherwise the result would be strongly affected by the occurrence of even a single case where the number of clusters found is incorrect. The results are listed in Table 1 (first row). For the sake of reference, the expected Bayesian misclassification rate is also indicated

(last column).

In addition, the proportion of samples drawn from the three Gaussians, the total sample set size Mused for training, and the degree of overlap between the Gaussians are also modified. The number of samples drawn from the Gaussians located at (-0.4,

-0.3), (0.4, -0.3), and (0., 0.3) are indicated by the triple , / m₂ l m₃, with M= w, + m₂

+ m₃; the expected Bayesian error rate for each sample set configuration considered is listed (last column in Table 1). A hyphen signifies an incorrect number of clusters. It is observed that hill-climbing with meanC labeling outperforms the other labeling methods and the SKIZ method (except for the 300/150/50 configuration). Hence, the method continues with meanC labeling.

Table 1: Misclassification rates (in %) for various sample set configurations using kMER learning, the SKIZ and the hill-climbing (HC) methods. Since HC operates in conjunction with a sample labeling method, the performance of various alternatives are listed: minimum Euclidean distance labeling (minEuC), cluster mean labeling (meanC), nearest-neighbor- (NNC), and Bayesian classification labeling (BayesC). The expected Bayesian misclassification rates are listed in the last column.

For the sake of comparison, instead of kMER, the following competitive learning schemes are also considered of which some directly perform, or can be made suitable for topographic map formation: the SOM algorithm, Conscience Learning (CL; De Sieno, 1988), The Inter Markovian Artificial Neural Network (TInMANN, Van den Bout and Miller, 1989), Frequency-Sensitive Competitive Learning (FSCL: Ahalt et al, 1990; Galanopoulos and Ahalt, 1996), and the BDH algorithm (Bauer et al, 1996). In order to obtain density estimates from the lattice weights produced by these learning schemes at convergence, the fixed kernel density estimation procedure outlined above is applied, including the automatic choice ofthe smoothness parameter, eq. (12). Hill- climbing and meanC labeling are then applied. The misclassification rates are given in Table 2. Note that the kMER results are those of Table 1 listed in the meanC column.

It is observed that kMER outperforms the other learning rules when the overlap between the three Gaussians increases (cf. the 300/300/0 cases), and only breaks down when the Bayesian error rate > 16, and when the number of samples drawn from the last two Gaussians decreases. The SOM algorithm performs particular badly in this case: it tends to oversample the low probability regions hence, a relatively poor definition ofthe "influence zones" is obtained and this, in turn, adversely affects the classification performance.

Table 2: Misclassification rates (in %) obtained with the hill-climbing method and meanC labeling listed for different competitive learning schemes.

4.2. Bayesian classification

One ofthe applications of density estimation is the construction of classifiers using the Bayes' theorem (Bayesian classification). This assumes that one can model the class-conditional densities separately, and then combine them with prior class probabilities to obtain estimates for the posterior probabilities (C, | v), namely the probability that v belongs to the class C„ which in turn is used to make the classification decision. The probability of classification is then minimized by selecting class C,. for which the posterior probability is the largest: P(C_i. \ v) > P(C_j \ v), Vj ≠ i *. (16) Evidently, kMER can be applied in combination with VK, to model the class- conditional densities p(\ | C,) separately but then it is necessary to dispose of a priori class information during training. However, since each class is in fact underpinned by a discrete number of kernels (with Nv M), if it were possible to sort out, without using class labels, which set of kernels belong to which class, then the class-conditional densities could be determined and unsupervised classification performed.

In order to do so, hill-climbing can be usedto label the neurons, the density estimates ofthe neurons with the same label can be groupd and these estimates used for modeling the class-conditional densities; the prior class probabilities then simply follow from the fact that each neuron is active with equal probability (equiprobabilistic map). This can be done in an efficient ways as follows, the method being an embodiment ofthe present invention.

4.3. Classification

Once the neurons are labeled, the class-conditional densities can be determind and a given sample classified into class C,. when: p(v \ C_i.)P(C_i.) > p(v \ C_j)P(C_j), Vj ≠ i*, (17) or when its posterior probability satisfies:

P(C_i. \ v) > P(C_J\ v), V/ ≠ /*, (18) by virtue of Bayes' theorem, and since the unconditional density p(\) is class- independent. It should be noted that, since ∑ P(C_j | v) = 1 , an indication can be provided ofthe "certainty" of class C,. by listing the most important posterior probabilities: if none of them are important compared to P(C,, | v), then the classification is quite certain, whereas in the opposite case reservations should be made about the classification result and, depending on the application, a decision may be made not to classify, or to consider working with the (few) most important class labels, weighted by the corresponding posterior probabilities.

This classification process can be simplified, thereby keeping the absolute value ofthe posterior probabilities intact, as follows. According to the Bayes theorem, for each class:

_^ . p(\ \ )P( ) P(C, ) = ^F ' ^{J , J} (19) p(\) which can be rewritten as:

Piy ONc,

PiC, \ v) = - (20) z with N the number of RF regions of class C„ and Z = ∑ P(C_j | v) (normalization

N_c constant). Note that P(C ) = = — '^■ — , since with kMER each RF region is

equiprobabilistic. Finally:

4.4. Risk minimization

Instead of minimizing the probability of misclassification, in some applications this might not be appropriate, e.g. when the consequence of a false negative classification is much higher than a false positive classification. In order to take this into account, one can define a loss matrix L = [L_υ] of which each element Z_y is the loss incurred by classifying a given input to class C, when it actually belongs to C,. The risk is then minimized when:

∑c ,./?(v | C,.)P(C_t.) < Y_jcL_kjp(y \ C_J )P(C_J), V ≠ i* . (22)

A=l k=ϊ The L 's can be based on experience or expressed in a more quantitative manner, e.g. in monetary terms. 4.5. Class boundary

For low-dimensional cases, since the class regions may not be contiguous, or even not convex, the class boundaries can be approximated by discretizing the input space into bins and by looking for adjacent bins for which the classes differ. Evidently, this process is too time-consuming when the dimensionality increases. In addition, it is not a good idea to determine class labels for the bins and use them for classification; it is computationally more sound to determine for each sample the class-conditional density and apply eq. (17) to determine the class to which the sample belongs or a restricted number of classes, when the most important posterior probabilities are also recorded.

4.6. Outlier detection

A qualitative way to determine if a given sample v* is an outlier of class C„ is to compare the class-additional density ofthe sample, p(\* | ), with the class maximum max (p (v e V\ C,)). However, finding the maximum of a (multi-kernel) density function is no easy task since there are likely to be several local maxima. In addition, it can only provide a rough indication ofthe outlier probability since it is not expressed in terms ofthe probability to find samples with lower class-conditional densities. Indeed, even for the case where the class-conditional density is defined by a single Gaussian kernel, there exists no analytic expression for the outlier probability as a function ofthe class-conditional density or the distance to the class mean.

An obvious way to obtain a measure for the outlier probability is to determine the proportion of input samples (from a give set) for which the class-conditional density is lower than/?(v* | C,). However, it is clear that the computations required to evaluate the (Gaussian) kernels can quickly grow out of hand.

A computationally more efficient procedure is to consider the proportion of input samples which fall outside the subspace defined by the union ofthe hyperspheres with radii ||v* - w, ||, Vi e . For single, radially-summetric kernel classes, this value immediately yields an estimate ofthe true outlier probability; for multi-kernel classes it is a heuristic the quality of which improves when the class-conditional density becomes more peaked. When the input samples would not be available, or not sufficient in number, an efficient way to proceed, when the kernels are radially- symmetric Gaussians e.g., is to generate a set of input samples for a single d- dimensional Gaussian, say with unit radius and zero mean. For each Gaussian kernel, these samples can then be shifted and stretched according to the kernel's actual mean and radius, and the proportion of samples determined that fall outside the union of hyperspheres mentioned earlier. This procedure is then repeated for each kernel of the class. The outlier probability is then estimated by the average of this sample proportion

(since the kernels are equiprobabilistic).

5. Topology-preservation Achieved by Monitoring kMER

For any topographic map formation algorithm that operates on fixed-topology lattices, topological defects may occur when the neighborhood range is too rapidly decreased. Hence, metrics have been devised in order to quantify the degree of topology-preservation of a given map, such as the topographic product (Bauer and Pawelzik, 1992) and the topographic function (Villman et al, 1997). (The topographic product is explained below.) Furthermore, such a metric can be used for monitoring the degree of topology-preservation achieved during learning, at least in principle for the topographic function, since it is computationally much more intensive than the topographic product.

Also for kMER, there is a need to monitor the degree of topology-preservation achieved during learning. Evidently, this can also be done with the topographic product or alike. However, since kMER uses overlapping, spherical quantization regions (RF regions), instead of non-overlapping, Voronoi-based ones, we can take advantage ofthe overlap to assess the degree of topology-preservation achieved in the map, albeit in a heuristic manner. Assume that the equiprobabilistic scale factor p is chosen in such a way that the neurons ofthe untangled map have overlapping RF regions. In that case, provided that the neurons are equiprobabilistic, a given map is more likely to be untangled if the number of neurons that are activated by a given input is constant over the training set. Indeed, the number of neurons that will be activated at a topological defect will be higher than in an already untangled part ofthe lattice. Furthermore, if that number is constant, it also implies that the map is locally smooth. Hence, it is desired to adjust the neighborhood "cooling" scheme in such a manner that the variability in the number of active neurons is minimized over the training set. This "variability score", divided by the mean number of activate neurons, is then a metric, which will be called the Overlap Variability (OV) metric. The reasons why it is advisable to develop such a metric are three-fold. First, the usual topology-preservation metrics are not very sensitive in detecting small topological defects, or in distinguishing them from locally non-smooth portions of the map. Secondly, assuming that the lattice is already untangled, the topographic product yields lower scores for larger neighborhood ranges. Thirdly, the usual metrics are quite heavy to run as a monitoring tool, during the learning phase.

6. Algorithm

Consider batch learning with t the current epoch and t_max the maximum number of epochs in the current simulation. For each input sample v", of a training set M of M input samples, determine the number of neurons that are active at epoch t: JV"(/) = ∑1,( . (23)

For the whole training set, the mean and standard deviation of this quantity are calculated:

M _μ _ \ sdπ (t =< (W(t) - mean (t)f >_M ² (24)

1

M

The Overlap Variability metric is then the ratio of these two quantities:

OV = ^$d"^{t) , (25) mean^ (t)

It should be noted that in the kMER algorithm, the number of neurons that are active at each epoch t are already determined, hence the overhead is minimal since only the mean and the standard deviation need to be calculated.

The following is defined as the neighborhood cooling scheme:

^σΛ ( = cr_Λo ^exP ^■ 2σ Λ0 ^■ϊov ,(26) max with σχ₀ the initial range and γ_os a parameter that controls the slope ofthe cooling scheme ("gain").

The monitoring algorithm proceeds as follows. (Note that the algorithm is exemplified in the simulation example discussed next.)

1. Since a random initialization ofthe weights are used at the start, the map is first trained with a constant neighborhood range, namely <τ_Λ0, during a fixed number of epochs, in order to obtain a more or less untangled lattice: this lattice then serves as a common starting point for all simulation runs. Let this lattice be called^⁰.

2. One complete simulation run is then performed, namely, until the neighborhood range vanishes. This is run number 1. γ_ov = 1 in eq. (4) - the superscript 1 refers to the first run.

3. Next, the number of epochs, and the corresponding neighborhood range are determined, for which O is minimal. These are labeled as follows: t¹ epochs,

, and OF¹.

4. A new run is performed, runy, 1 <j ≤ max_^^, starting from lattice A⁰, with the same cr_Λ0, but using t_m ^j _ax <— 2t ^-1 epochs and by ensuring that it ends the simulation with a neighborhood range identical to the previously optimal one, σ^J~ , plus a margin: ^σ _{A a} j⁼ 0-9<?_Λ ^-1 • ^{m m}i^s way, the method cools at a slower rate, but the simulation is only run as long as necessary. In order to achieve the proper neighborhood cooling scheme, the gain is adjusted as follows:

' Λ ΛΟO

Yov (27)

^2σΛ0

5. Next t^J and σ{ are determined. (It should be noted that, for they^'th run, it could happen that the minimum coincides with the last epoch: in that case, this minimum is taken and the method continues.) 6. As long as OV^J < OV^J' in a significant manner, go to step 4, else stop.

Several types of stopping criteria are possible. The following have been used: when two successive runs lead to higher OFs, then stop and retain the map which corresponds to the best OF- value. If a run yields a higher OF- value than the previous run, then start the next run with 1.5 times longer t_max.

7. Example

Consider the two-dimensional "curved" distribution shown in Fig. 8 (M= 500 samples). The shape ofthe distribution is too nonlinear to be described by a straight line, namely, the principal eigenvector (dashed line). The best description is a "principal curve" which is defined in such a manner that the center of gravity between two infinitesimally separated curve normals coincides with the curve itself. The distribution was generated by randomly sampling the circle segment (thick line), defined between 0.3 and π - 0.3 rad, and by adding Gaussian white noise (σ = 0.1) to both coordinates. Hence, ideally, the principal curve estimate should coincide with the circle segment.

A one-dimensional lattice (i.e. a chain) sized N= 25 neurons is developed. The same initial weights are taken as for the SΟM algorithm. In addition, the RF radii of kMER are initialized by taking samples from the uniform distribution [0.3; 1.3]. kMER is applied in batch mode and with p= 2 and the learning rate η= 0.001. The monitoring results are summarized in Fig. 9.

Step 1 of our monitoring algorithm is first performed and kMER run over 25 epochs with σ_A0 = 12.5. The OF and the neighborhood cooling plots are shown in Fig. 2A (thick and thin continuous lines). It is observed that, after a transitional phase, OF stabilizes. Then step 2 ofthe monitoring algorithm is performed over t_m ^l _ax = 100 epochs. A clear minimum is observed, which is located at t¹ = 12 epochs (step 3) (OF = 0.394). Step 4 of the algorithm is then performed and kMER run over t_max = 2 12

= 24 epochs and with γ_ov = 0.124 (thick line in Fig. 9B). The minimum OFis now at t_m ² _ax = 21 epochs (step 5) (OF= 0.378). Since OF² < OF ' (step 6), the method continues (step 4). Finally, since runs 6 and 7 do not lead to better OF-values than that obtained at epoch t⁵ = 118 in run 5 (thick line in Fig. 9C), the monitoring algorithm is stopped (OV= 0.373).

For the sake of comparison, the topographic product (TP) is plotted (thick dashed line in Fig. 9A). Note that the desired P-value here is zero (thin dashed line). However, unlike the overlap variability, there is no clear optimum in the topographic product plot.

Finally, the lattices are shown obtained without and with monitoring in Fig. 10A, B, respectively: the former is the result of continuing the first run until 100 epochs have elapsed, while the latter is the result ofthe fourth run. The effect of monitoring is clearly seen: the lattice is untangled and closely matches the theoretical principal curve ofthe distribution (dashed line). The results for M= 4000 samples are also shown (Fig. 10C). It is observed that the lattice is smoother and even closer to the theoretical curve than in the M= 500 case, as expected, since the RF radii will be better defined and thus also the RF centers (neuron weights).

8. Topographic product

The topographic product metric was originally introduced in the context of nonlinear dynamics and time series analysis ("wavering product") (see Bauer and Pawelzik, 1992). The following definitions are required. Consider a lattice A ofN neurons. Let r, be the lattice coordinate of neuron i and let i(k, A) be the k-th nearest neigbor neuron of neuron , with the nearest neighbor defined in terms of the Euclidean distance d in lattice coordinates:

(28) Similarly, let i(k, V) be the k-t nearest neighbor of neuron i with the nearest neighbor defined in terms ofthe Euclidean distance in the input space F. Define the following two ratios:

Following these definitions Qd , k) = Q₂(j, k) = \ only when the nearest neighbors of order k in the input and output (lattice) spaces coincide. However, this measure is still sensitive to local variations in the input density. In order to overcome this, the Q values are multiplied for all orders k. After some algebraic manipulations, the following equation is obtained:

which in turn needs to be averaged in a suitable manner. Bauer and Pawelzik suggest the following averaging procedure, which also results in the definition of topographic

product s: (31)

What is desired for TP is that it should be as close as possible to zero for the lattice to be untangled as much as possible. (Note that TP can be larger or smaller than zero.)

9. Running kMER in high-dimensional spaces The typical real- world application of data analysis, e.g. data mining, involves analyzing multi-dimensional data to extract occult relationships. In accordance with an embodiment ofthe present invention this involves the integration of several components: kMER learning, density estimation and optimal smoothness determination, clustering with hill-climbing technique or similar, input sample labeling, and the effect ofthe input dimensionality on the system's performance. In accordance with the present invention hierarchical clustering may be performed. This involves determining major clusters in a first application of kMER (level 1) followed by a further application of kMER on individual clusters in level 1 to resolve more detail (clusters, level 2) and so on. The result of repeated clustering and application of kMER is a tree of interlinked datanodes. An example tree of datanodes is shown in Fig.19. The present invention also includes use of kMER to generate the first level clusters followed by application of other methods to analyze these clusters further.

The clusters are extracted in each level after smoothing, density based clustering and labeling. Smoothing is a form of regression analysis which renders the topographic map in a more suitable form for clustering by eliminating local discontinuities and irregularities. Density based clustering after smoothing is preferably performed by hill-climbing but the present invention is not limited thereto.

Labeling is the association of data with the relevant cluster. This operation must be accurate, correct data should be associated with the appropriate cluster if the total data model is to remain consistent despite distributed processing (Figs. 1 and 2). With each datanode in the tree, the related subset of data is extracted from the complete data. A datanode is a kernel-based processing engine (e.g. a neural network) which represents a part ofthe complete data model generated from the complete data as well as the associated data. A datanode in accordance with the present invention is also a data model. This extraction of data is safe because ofthe linear mapping between actual and represented data density in the topographic map produced by kMER.

The most generalized version ofthe data model forms the top datanode - i.e. level 1, Fig. 19. Repeated application of kMER on the clusters generated in one level resolves more detail, i.e. generates more clusters. At some point there is little or no detail left for kMER to identify in clusters. Then, further development ofthe tree may be stopped. Clusters which cannot be resolved any further are leaf datanodes. The leaf datanodes in Fig. 19 are surrounded by ellipses. Leaf datanodes may be generated in any level, in Fig. 19 they appear at level 2.

In the following aspects ofthe present invention mentioned will be described with reference to an application which is particularly difficult due to its high dimensionality (512 spectral components), and the different levels of similarity that can be identified in the music signal's spectrogram: the similarity between the spectra of different musical instruments playing the same note, and the similarity between the spectra of different notes played by the same musical instrument. Although the application is provided as an example, its output is tangible and useful, namely a separation ofthe notes and instruments. In this example principal components analysis is used. This is however, purely optional. In normal use the data to be processed is fed directly into the kMER processing of Fig. 11. Optionally, data preparation may be performed on the data prior to application ofthe kMER algorithm. Hence, in the normal case, the method step "FFT" of Fig. 11 may be a data preparation step.

In the following reference will be made to a two-dimensional topographic map but the present invention is not limited thereto. Preferably, the topographic map has less dimensions than the input space as this results in a data model which is accurate but compressed in size, but the present invention is not limited thereto.

The music signal was generated on a Crystal 4232 audio controller (Yamaha

OPL3 FM synthesizer) at a sampling rate of 11,025 Hz, using the simulated sound of an oboe, a piano and a clarinet. Eight notes (one minor scale) were played on all three instruments consecutively, namely "F", "G", "Ab", "Bb", "C", "Db", "Eb", and the eight note is an "F" again but one octave higher. This results in a signal track of 14 seconds. By observing the fundamental frequencies only, one can clearly see the repetition of eight notes of increasing pitch, played by three instruments consecutively

(Fig. 19). Since the music signal is computer generated, there is hardly any variation on the steady state parts. Hence, the distribution which underlies these parts is discrete (probability distribution). In order to obtain a continuous distribution (probability density or briefly, density), uniform white noise has been added to the signal track (SNR = 12.13 dB). (In fact, Fig. 19 shows the noisy spectrogram.)

The music signal's spectrogram was obtained by computing the Short-Time Fourier Transform (STFT) every 256 samples, using time windows (frames) of 1024 samples (no windowing has been applied). This results in a training set of M= 621 spectra of which only the amplitude components are retained (the DC component is omitted). Furthermore, since features are sought that are invariant to a scaling ofthe music signal amplitude, all amplitude spectra are normalized.

10. System overview

Since the spectral data are high-dimensional, and since more than one interpretation can be given to describe their similarity, a hierarchical clustering analysis is performed. Several levels in the analysis are distinguished, of which two are developed in Figs. 11 and 12.

10.1. Level 1

A Principal Component Analysis (PCA) ofthe STFT ofthe music samples

s(t) = [s(t), ..., s(t - 1023)] is performed first and their amplitude spectra F(s) are projected onto the subspace spanned by the first k_PC Principal Components (PCs) ofthe training set (boxes labeled "FFT" and "PCA" in Fig. 11). The projected result is indicated by the vector v = [v, ..., v„ ..., v_k ] e Vςz 9?⁵¹², with v, the length ofthe projection onto the ith PC vector. A two-dimensional rectangular lattice in this subspace ("kMER") is then developed and the converged lattice used for density-based clustering ("Clustering"). The following steps are distinguished. Firstly, the converged lattice is used for estimating the density distribution underlying the projected amplitude spectra. The scale factor ofthe kernel-based density estimate is optimized in order to obtain the optimal degree of smoothness. Finally, density-based clustering is performed, using the hill-climbing technique, and labels assigned to the neurons ofthe same clusters ("Clustering"). Following the "Clustering" stage, the individual amplitude spectra are labeled ("Labeling") using the minimum Euclidean distance labeling technique (minEuC). The labeled spectra are represented as F¹ (1, i, s), where the superscript indicates Level 1, the parameter 1 the inherited cluster label (which is set to 1 by default, since Level 1 is the "root" level ofthe clustering tree), and the cluster label that music signal tract s receives at Level 1.

As a result ofthe combined clustering/labeling procedure, a tessellation is performed on the spectral space, V- , C„ with C, the non-overlapping subspace of cluster i. However, contrary to a Voronoi tessellation, the subspaces are irregularly- shaped and thus not necessarily convex. Furthermore, since the present subspaces cannot be expressed in terms of projections onto sets of orthogonal basis vectors, the present subspace definition is different from that used in the conventional transform coding methods such as Oja's subspace classifier (1983) and the mixture of principal components (MPC) method (Dony and Haykin, 1995; Haykin, 1996). In addition, since the kernel centers and radii, which are underpinning the clusters, are not spectral bandpass filters, the approach in accordance with the present invention differs from Kohonen's ASSOM (Kohonen, 1995; Kohonen et al, 1997), albeit that the neuron weights could be inteφreted as "features" and, thus, the map as a topographic feature map.

10.2. Level 2

Since the density estimate is expected to be very different in each cluster, the analysis is continued and a "clustering within the clusters" is performed: the Level 1 procedure is repeatedfor each cluster separately (ith Level 2 analysis in Fig. 12), starting from the amplitude spectra which received the same Level 1 label, F¹ (1, , s). However, the refinement of a given Level 1 cluster is only done when there is more than one cluster at Level 1, and the Level 2 result is accepted when the clustering analysis is valid. In order to test the latter, a simple heuristic - called the "Continued

Clustering Analysis" (CCA) test - is developed: it should be passed before the clustering results at Level 2 are accepted; in the opposite case, no "zooming in" is performed and the Level 1 cluster label is a leaf node ofthe clustering tree.

As a result ofthe Level 2 clustering analysis, the amplitude spectra receive an additional Level 2 label ("Labeling" in Level 2). Hence, in this way, a hierarchical clustering analysis ofthe amplitude spectra is performed. The labeled spectra are represented as ¥²(i,j, s), where the superscript indicates Level 2, i the cluster label inherited from Level 1, and/ the cluster label received at Level 2.

10.3. Additional levels

It is clear that this "clustering within the clusters" can be continued, and a third level added to the clustering analysis, and so on, as long as the CCA test is passed.

11. Detailed description and simulations

11.1. Level 1

PCA In order to assess the effect ofthe dimensionality ofthe input to the kMER algorithm, the PCs ofthe training set of spectral data are computed, after subtracting the set average. The dimensionality ofthe spectral data is then reduced by projecting them onto the first k_PC PCs (k_PC = 16, 32, 64, 128, 256, 512). In this way, the system's performance can be verified for different dimensionalities and thus for different degrees of sparseness ofthe space Fin which the lattice is trained with kMER. When

plotting the normalized cumulative distribution ofthe eigenvalues, — Σ —» —^ΛJ , the 90%

level is reached for k_PC = 18, the 95% for k_PC = 32, and the 99% for k_PC = 144. Only the k_PC = 16 case is visualized, but the other cases as well are discussed.

kMER A lattice A, sized N= 20 x 20 neurons, is trained with kMER in batch mode, for p_r = 2, using the monitoring algorithm. In accordance with the latter, kMER is run for 50 epochs using the usual

Gaussian neighborhood function but with a constant range (σ_A0 = 10, η= 0.01). The monitoring procedure reaches its lowest Overlap Variability (OF) when decreasing the neighborhood range over 250 epochs with gain parameter y_oy = 0.3 in eq. (4) ( η =

0.001). Finally, an additional run over 50 epochs is performed, thereby keeping constant the "optimal" neighborhood range determined with the monitoring algorithm (η= 0.001): this is done in order to average out the weight updates for the whole training set. As a result, the "converged" lattice is obtained.

Density estimation The method proceeds by determining the variable kernel density estimate p , which corresponds to the converged lattice, by exchanging the neuron weights w, and RF radii σ_t by Gaussian kernels K with centers w, and kernel radii p_sσ_t. The next step is to determine the optimal degree of smoothness. However, in order to reduce the computational effort, since optimizing for p_r requires several runs ofthe monitoring algorithm (and thus of kMER), only the scaling factor/^ is optimized, but accordingly the definition ofthe fixed kernel estimate is slightly modified:

with σ* - < σf>_A and Z the usual normalizing factor. Note that p_s ■ σ* appears in eq. (32) instead of only σ* as in eq. (13). This is done in order to take into account the fact that optimization for p_r is not carried out. Indeed, as can be shown by batch mode simulations, when in addition to p_s, p_r is also optimized, then p_s » l and p_r varies on a large scale. By consequence, it is expected that, when p_r is kept fixed, as in the current application, p_s is now going to vary on a much larger scale than before, so that a similar degree of smoothness is needed in the fixed kernel estimate and thence, the term a* in eq. (13) is replaced by p_s • σ* in eq. (1).

Furthermore, in the original density-estimation procedure (see eq. (10)), equivolume Gaussian kernels defined in kMER's input space Fhave been used.

However, since Gaussians scale inversely proportional to σ , numerical problems are bound to occur when d becomes large, as in the present case. In order to solve this problem, an embodiment ofthe present invention makes use of an approximate representation ofthe input data density. In particular, the representation preferably has less dimensions than the input data space, for example, use is made of a principle manifold. In a particularly preferred embodiment ofthe present invention advantage is taken ofthe fact that the two-dimensional lattice forms a discrete approximation ofthe possibly nonlinear manifold in kMER's input space, and a two-dimensional density estimate with respect to this lattice is developed. The variable kernel density estimate then becomes:

The "optimal" /? value for which M SE\p , p_p ^* j is minimal is then developed:

P_s (34)

thus, with respect to the training set { v"} (or a randomly selected subset of it), rather than with respect to the neuron weights, as done in eq. (15), which in practice becomes impossible for high-dimensional spaces. Figure 13A shows M S E p_p , p^* j as a function of p_s (Δ/?_s = 0.625). The optimal value for p_s was 14.375 (note that p_r was fixed at 2).

Finally, the density estimates are determined at the neuron positions, since these are needed by the clustering procedure discussed next.

Clustering The hill-climbing clustering procedure described above is applied to the density estimate p . Figure 14A shows the number of clusters found as a function of

N k, where 1 < k ≤ — . It is clear from the longest plateau that there are 8 clusters. A

cluster map is developed for the smallest value of & that yields 8 clusters (Fig. 15). Eight contiguous labeled areas are observed, indicating that the neighborhood relations in the input space are preserved, i.e., that there is indeed a topographic map.

Labeling Using the cluster map, the spectra in the training set are labeled by adopting the minEuC labeling technique. Figure 16 shows the (thresholded) normalized spectrogram ofthe noisy music signal and the labeling results for £_pc = 16. With respect to the 621 training spectra, 69 of them receive the same label, namely, the note "Ab" played by the three instruments. 11.2. Level 2

The spectra with the same Level 1 label are now further considered for analysis in the Level 2 part of the system, and the results accepted when the CCA test passes:

1. when there is a valid minimum in the optimal smoothness curve, namely one that is different from the degenerate case p_s o, and

2. when in the plot ofthe number of Level 2 clusters n as a function of the number of nearest-neighbor neurons k, the trivial case n - 1 is not reached for k less than one- third of N, the total number of neurons in the lattice.

Since the labeled spectra constitute a new training set, the PCs of this set are computed, and the dimensionality ofthe labeled spectra is reduced by projecting the labeled spectra onto the first k_PC PCs (k_PC = 4, 8, 16, 32, 64). Note that the number of PCs should at least be lower than the number of labeled spectra in the training set. As an example, the spectra labeled as "Ab" at Level 1 are considered. When plotting the normalized cumulative distribution ofthe eigenvalues, the 95% level is reached for k_PC = 16 and the method continued with this value. Similar to Level 1, a lattice sized N= 7 x 7 neurons is developed using kMER and the monitoring algorithm. Note that the lattice is now about 10 times smaller than the one used at Level 1 since there is the order often times less labeled spectra to work with. Again in accordance with the monitoring algorithm, kMER is initially run with a constant neighborhood range, σ^ = 3.5, during 50 epochs (η- 0.01; p_r - 2), and the method continues to train the lattice but now by decreasing the neighborhood range (η - 0.001 ; p_r = 2). The monitoring procedure generates the lowest OF- value when the neighborhood range is decreased over 670 epochs (γ_ov= 0.685; η- 0.001). Finally, an additional run of 50 epochs is performed with the "optimal" neighborhood range. Gaussian kernels are then allocated at the neuron weights, the optimal smoothing factor of the resulting density estimate determined (p_{s opt} = 0.475, Ap_s =

0.0375) (Fig. 13B), hill-climbing clustering performed on it, and the number of Level 2 clusters determined (Fig. 14B). Based on the results shown in Figs. 13B and 14B, it is concluded that the CCA test has been passed. Finally, the cluster map is determined

(Fig. 17). The effect of k_PC is: for k_PC ≤ 32, 3 clusters are found, but for k_PC = 64 a tie for 2 and 3 clusters is found.

Finally, for k_PC = 16, the 69 spectra that received the label "Ab" at Level 1, now receive the labels "oboe" (22 spectra), "piano" (26), or "clarinet" (21) at Level 2 (Fig.

18).

11.3. Additional levels

The spectra of all notes and instruments can be identified after two levels of clustering are performed, except for the two "F" notes; although they differ by one octave, they have a similar harmonic structure. The corresponding spectra are grouped into one cluster at Level 1 (144 in total). At Level 2, two clusters are found: one of them combines the clarinet and the piano that play the "F" note with the lower pitch (48 spectra), and the other combines the "F" note with the higher pitch (for all instruments) and the "F" note with the lower pitch for the oboe (96 spectra). At Level 3 the former cluster decomposes into two clusters, one for the clarinet and the other for the piano (24 spectra each), and the latter cluster decomposes into 2 cluster of which one represents the "F" note with the higher pitch for the clarinet (25 spectra) and the other represents a combination (71 spectra) which is, finally, at Level 4, decomposed into the "F" note with the lower pitch for the oboe (24), and the "F" note with the higher pitch for the piano (25) and the oboe (23). Hence, the decomposition into the "F" note with the lower and the higher pitch, and the musical instruments that play them is complete.

In summary, two levels in the clustering tree suffice for labeling all notes and instruments except for the two "F" notes where four levels are needed (Fig. 19).

12. Alternative density estimation

Since a kernel has been allocated at each neuron weight, instead of at each input sample, an alternative interpretation of density modeling with kMER can be formulated. The connection with the models could be considered in which the density p(\) is represented by a linear superposition of component densities:

N p(v) = ∑p(v \ i)P(i) . (35) ι=l Such a representation is called a mixture distribution. The term P(i) is the prior probability of input sample v being generated from component ofthe mixture and p(\ I i) the zth component density. The P(i)'s are often regarded as mixing parameters, in addition to the parameters which specify the component density function. For the mixing parameters:

∑P( = 1 , (36) ι^'=l with 0 < P(i) ≤ 1, and for the component density functions:

Since the link with prior probabilities and conditional densities has been made, the corresponding posterior probabilities can be defined, using Bayes' theorem:

. p(v \ i)P(i)

P<! \ v = 'Λ (38) P(v) with ∑. . P(i I v) = 1. For the component densities, usually (d-dimensional) Gaussians are assumed with diagonal covariance matrices (i.e. circular-symmetric Gaussian kernels):

with w, and σj the kernel center and radius.

It is clear that kMER's density estimation format can be re-written in terms of a mixture distribution, with the component densities represented by the kernel functions

K, but now with unit volumes, and with the prior probabilities all being equal to — .

N According to this format, an input sample v can be generated by first selecting one of the components i at random, with probability P(i), and then by generating the input sample from the corresponding component density p(\ | )■ As far as (pattern) classification is concerned, when viewed as functions of their parameters, the component densities are often referred to as likelihood functions for the observed values of v.

Mixture models belong to the category of _se/ -parametric density estimation methods, since a specific function is assumed for the kernels and since the number of them, N, can be treated as a model parameter. This offers the advantage that the size of the density model can be adjusted to the complexity ofthe input sample set, rather than to the sample set size M, as is done in the classic VK method (where a kernel is allocated at every input sample). The disadvantage is that the model has to be trained and this can be computationally expensive.

Density estimation with kMER, using the VK format, can be considered as a semi-parametric method when the fact is emphasized that the number of kernels is much lower than the number of samples, N « M , but, since a fixed lattice is used, N does not vary during training. Furthermore, no assumption is necessary of a specific type of kernel function during training, since a pilot density estimate is generated with which the parameters ofthe variable kernels can be chosen. There is also a more fundamental distinction between the two training procedures which is discussed below.

Finally, it should be noted that the importance ofthe mixture models is not only restricted to density estimation. The technique also finds its applications in other neural network areas, i.e. in configuring the basis functions in radial basis function (RBF) networks, in conditional density estimation, soft weight sharing, and the mixture-of-experts model (for an overview, see Bishop, 1995). In the same spirit, kMER has also been considered recently for configuring RBF's (Ridella et al, 1999).

12.1. Connection with pattern classification

There is a strong similarity between the mixture model eq. (35) and the mixture of classes representation in pattern classification:

p(y) = ∑_P(y \ C, )P(C_l) , (40) ι=l for c different classes C,, ..., C_c, and for a continuous pattern vector v. Indeed, the component densities can be regarded as the equivalent of class-conditional densities, and the mixture parameters as prior class probabilities (but in practice, the class- conditional densities will be supported by several component densities). The key difference between the two lies in the nature ofthe training data since, in the mixture model case, no "class labels" (unlabeled data) are provided.

12.2. Distinction to maximum likelihood learning The most widely used techniques for determining the parameters of a Gaussian mixture model are based on maximizing the likelihood ofthe model parameters for the given input sample set (for a review, see Redner and Walker, 1984) which is, for most cases, expressed in terms of the EM algorithm (Dempster et al, 1977).

For the Gaussian mixture case, the following parameters have to be estimated: the mixing parameters P(i), the kernel centers w„ and the kernel radii σj, / = 1, ..., N.

The negative log-likelihood for the input sample set is given by:

M M f N

E = -ln£ = -∑ln/>(v") = -∑ln ∑/>(v" | i)P( ) (41) /=! μ=\ . 1=1 with L the likelihood and E a cost function. Maximizing the likelihood is then the equivalent to minimizing E. When taking for the mixture parameters the softmax function (normalized exponential): 0 -# £)_, (42) ∑_j exp( ,) and when optimizing the gradients with respect to E, one obtains the following interpretation for the model parameters w, and σ_t. For the kernel centers one obtains:

∑^M _u P(i v")v" w, = ^• (43)

ΣlA> ^V) which says that the kernel center is the weighted mean ofthe input samples, weighted by the posterior probabilities, for the kernel radii the following is obtained: y^μ - W;

-Ά=I

<τ.² = i Σl/o^' (44)

which corresponds to the weighted variance ofthe input samples, and finally, for the mixing parameters: ι M

(45)

which signifies that the prior probability for the ith component is obtained from the average ofthe posterior probabilities for that component.

It is clear that kMER starts from different assumptions and determines the kernel centers and radii in a completely different manner, but there is also a more subtle difference when density estimation is concerned. In the maximum likelihood method, first the kernel centers and radii of the component densities p(\ \ i) are determined, and then the prior probabilities P(i). In kMER, the prior probabilities are determined in such a way that they become all equal, by adjusting the kernel radii, and at the same time the component densities are determined since the kernel centers are also adjusting during learning. Or, in other words, the prior probabilities are not considered to be additional model parameters.

13.0.0 Regression Embodiments In the following a specific example ofthe application ofthe above embodiments is described with respect to distributed regression modeling as a further embodiment ofthe present invention. In accordance with this embodiment the data set is partitioned into one or more subsets for which data and regression modules are developed separately. The subsets consist of either complete data vectors or incomplete data vectors, or of a mixture ofthe two. The data vectors are incomplete since they contain missing vector components. This could be due to a partitioning into subspaces. The ensemble ofthe data modules form the data model and, similarly, the regression modules form the regression model. This embodiment ofthe present invention minimizes the need for communication between the data and regression modules during their development as well as their subsequent use: in this way, delays due to communication are minimized and, since only model parameters need to be communicated, issues concerning data ownership and confidentiality are maximally respected and data transfer is reduced to a minimum.

In non-parametric regression analysis, an unknown mathematical function/is modeled given a finite number of possibly inaccurate covariates or data points, without requiring any a priori assumptions other than perhaps a certain degree of smoothness. Hence, contrary to standard curve fitting procedures, no regression function parameters are optimized. The unknown function can be a scalar or a vector function. In the scalar (univariate) case, a function^ -J\x), with x e Fc 5R^d, needs to be estimated from a given set of M possibly noisy input samples:

y"

+ "oise, μ = 1 , ..., M, (46) where the noise contribution has zero mean and is independent from the input samples x^μ. In the vector (multivariate) case, the function y = x), with y e 9^ is estimated as follows:

y^μ = (x^μ) + noise, μ = 1 , ..., M, (47)

The vector case is often treated as the extension ofthe scalar case by developing d_y scalar regression models independently, one for each vector component.

The regression performance improves if the number of "knots" in the model are not fixed but allowed to depend dynamically on the data points: indeed, "dynamic" knot allocation is known to yield a better regression performance than its static counteφart (Friedman and Silverman, 1989). It should be noted that the "knots" are the points in F space that join piecewise smooth functions, such as splines, which act as inteφolating functions for generating values at intermediate positions. Alternatively, kernels are centered at these points (kernel-based regression).

Traditionally, kernel-based regression proceeds as follows. One can localize at each training sample x^μ a kernel, such as a circular-symmetrical Gaussian, with a height equal to the corresponding desired output value y"; all Gaussians have the same radius (standard deviation) which, in fact, acts as a smoothing parameter. Formally, the output ofthe regression model, for a given input x, can be written as:

in the univariate (scalar) case, and

O_i(x) i = \, ... ,d_y , (49)

in the multivariate (vectorial) case. In principle, cris directly related to the noise distribution ofthe input signal however, such a measure is not always available. On the other hand, one can always determine the radius by cross-validation (see further). The approach just described is adopted in the general regression neural network (GRNN) (Specht, 1991).

Alternatively, one can choose to localize only a restricted number of kernels, for example, N circular-symmetrical Gaussians. The kernels are normalized so that an inteφolation between the kernels centers can be carried out, e.g., using the normalized

Gaussian:

The output ofthe regression model for a given input x is then defined as:

N

0(x) = ∑W_jgj (x,p_s) (51)

7=1 in the univariate (scalar) case, and

N

O_l (x) = ∑W_lJg_J (x,p_s), / = 1, . (52)

in the multivariate (vectorial) case. The W_} and W_y terms are additional parameters of the respective regression models. The output ofthe univariate regression model is intended to approximate the desired input/output mapping:

0(x^μ) ≡ y^μ _* y^μ = f(x^μ), V/Λ (53)

Similarly, the output vector ofthe multivariate regression model can be written as:

O(x") = y" * y" = /(x , Λ (54)

with O = [O,]. The positions (centers) w, ofthe Gaussians are chosen in such a manner that the regression error, i.e., the discrepancy between the output ofthe model and the desired output, for a given set of input/output pairs is minimal; the radii are chosen in an ad hoc manner. This is basically the Radial Basis Function (RBF) network approach introduced by Moody and Darken (1988). One could also extend this minimization of the regression error by including a common kernel radii σ, or even by including variable radii in this process (Poggio and Girosi, 1990).

In summary, the parameters ofthe kernel-based regression model, i.e., the regression weights W_] and W_φ kernel centers w, and possibly also the common kernel radius cror the variable radii σ„ are optimized by minimizing the regression error. This is usually done by a learning algorithm which iteratively adjusts these parameters until the regression error becomes minimal (e.g., for a separate test set of input/output pairs).

The foregoing may be described as a monolithic approach: all model parameters are optimized or chosen with a given regression task in mind. The present embodiment provides a modular one: in essence, the kernel centers w, and kernel radii σ, are first determined separately from the regression task. This is done with an optimization procedure that operates on samples x^μ drawn from the input distribution (in F-space). This results in what is called a data model. Then, for a given regression task and, thus, for a given set of input/output pairs, only the regression weights W_j and W_y are optimized so as to minimize the regression error; the data model parameters are kept constant. This second, regression model is specified by the regression application at hand. The data and regression models are, thus, developed separately, and they operate in sequence (vertically modularity, see Fig. 20A).

The data model is developed in accordance with the kernel-based topographic map formation procedure described above. As a result, the kernel centers w, and kernel radii σ, are obtained in such a manner that an equiprobabilistic map is obtained: each kernel will be activated with equal probability and the map is a faithful representation ofthe input distribution. The present embodiment provides the following advantages:

• the regression surface will be locally more detailed when locally more samples are available and vice versa • several regression applications can be grafted onto the same data model (Fig. 20 B)

• the set of input samples x^μ can be split into subsets for which separate regression modules can be developed. The ensemble of these modules then forms the regression application (horizontal modularity, see Fig. 25A)

• the input space can be split into subspaces for which regression modules can be developed that together form the regression application (horizontal modularity, see

Fig. 25B)

• the input samples x^μ may be incomplete, i.e., some ofthe vector components may be missing (incomplete data handling)

Hence the present embodiment is modular in two ways: vertically modular (data model and regression model) and horizontally modular (data modules and regression modules).

13.0.1 Vertical modularity embodiment 13.1. Data model

The data model consists of a lattice A, with a regular and fixed topology, of arbitrary dimensionality d_A, in ^-dimensional input space Vςz 9t To each ofthe N nodes ofthe lattice corresponds a formal neuron which possesses, in addition to the traditional weight vector w„ a circular- (or hyperspherical-, in general) activation region S„ with radius σ„ in F-space (Fig. 21 A). The neural activation state is represented by the code membership function:

with x € F. Depending on 1, , the weights w, are adapted so as to produce a topology- preserving mapping: neighboring neurons in the lattice code for neighboring positions in F-space. The radii σ, are adapted so as to produce a lattice of which the neurons have an equal probability to be active (equiprobabilistic map), i.e.,

(l, (x) = l) = — , Vi , with p a scale factor.

13.1.1. Training the data model

The data model is trained as follows. A training set M = {x^μ} of M input samples is considered. In batch mode, the kernel-based Maximum Entropy learning Rule (kMER) updates the neuron weights w, and radii σ, as follows (Van Hulle, 1998, 1999b):

Δ , = 7 ∑ ∑Λ(ι, , σ_Λ ) Ξ_y (^) _Sgn(x^ -w, ), and, (56) x"eM jeA

^Δ pN with p_r =— — , and Λ(.) the usual neighborhood function, e.g., a Gaussian, of which N - p the range σ_A is gradually decreased during learning ("cooling"), and Ξ, the fuzzy code membership function of neuron i:

so that 0 < Ξ,(x) < 1 and ∑.Ξ_i(x) = \ . p_r > 1 is used so that the activation regions overlap. Finally, the previous batch learning scheme is also available for incremental learning.

The characteristics ofthe trained data model can be summarized as follows: 1. the lattice forms a discrete estimate to a possibly non-linear data manifold in the input space F 2. the lattice defines a topology-preserving mapping from input to lattice space: neighboring neurons in the lattice code for neighboring positions in F-space 3. more neurons will be allocated at high density regions in F-space ("faithful" map, Van Hulle, 1999b).

13.1.2. Incomplete data handling

The input data x = [ j used for training could be incomplete: for some vector components there is no value available. These vector components will be treated as "don't cares", X_j = x. In particular, the active neurons are determined and the weights and radii are updated only by using the input vector components for which a value is available. The adapted kMER algorithm is given in Fig. 22, in batch mode, but it can equally well be shown for incremental learning.

Filling-in missing entries

Once the data model is developed, the missing entries can be filled in by applying the incomplete input vector to the map, the neuron with the closest weight vector can be determined, by ignoring the missing vector components, and the latter can be replaced by those ofthe closest weight vector. The algorithm is shown in Fig.

23. Instead of using only the closest weight vector, all active neurons could be selected, and their weight vectors could be used for deriving estimates for the missing input vector components. One way to do this is by determining the average ofthe weight vectors, and then perform the substitution using this averaged vector. The neuron radii could even be invoked to determine the average. For example, if vector component v is missing, then the following can be done:

with S_a the set of the active neurons, and d the dimensionality ofthe input space. Still another way is to do the following:

with/?(w,) the density (estimate) at position w, in F-space (for the use of density estimates, see below).

13.2 Regression application

When the topographic map A is developed, thus, without using information about the desired outputs, the circular (spherical, hyperspherical) activation region __?, of each neuron can be supplemented with a kernel, e.g., a normalized Gaussian:

with p_s a smoothing factor. These kernels are then used as the neurons' output functions. The output ofthe regression model for a given input x can be written as follows:

N

0(x) = ∑W_{j gj}(x,p_s) (62)

7=1 in the univariate (scalar) case, and

N

O_i(x) = ∑W_iJ g_j (x,p_s), ι = l. (63)

7=1 in the multivariate (vectorial) case.

Hence, there are two types of unknown variables in the regression model: the weights W_j, V , or W_ip Vi^' , in the univariate or the multivariate case, respectively, and the smoothing factor p_s.

13.2.1. Training the regression application

Several types of algorithms are included within the scope ofthe present invention. In the folowing the simplest one is described, based on gradient descent, but the present invention is not limited thereto. A set of input/output pairs is considered, i.e., the training set: M = j(y ,x'") | μ - 1, ..., M j. Training the weights W- is performed with the delta rule:

AW_j = τp[γ^μ - W_jg_jix'.p^g_jix", _s), (64) in the case of univariate regression. The rule is readily extendible to the multivariate case by developing one series of weights for each output vector component. The following training set is considered: M. = ^y^μ,x^μ) | μ = 1, ..., Mj, with y^μ = [y^μ].

The delta rule now becomes: Wy = η(y^μ - W_ijgj(x^μ,p_s))_gj(x ,p_s), (65)

Projection pursuit regression

Another embodiment ofthe present invention applies projection pursuit regression (PPR) (Friedman and Stuetzle, 1981) to kernel-based regression with topographic maps. In PPR, the <i-dimensional data points are inteφreted through optimally-chosen lower dimensional projections; the "pursuit" part refers to an optimization with respect to these projection directions. For simplicity, the case will be considered where the function/is approximated by a sum of scalar regression models

O_k:

0(x^μ) ≡ y^μ = ∑O_k (*_kx^μT), (66) yt=l with a._k the projection directions (unit vectors) and where T stands for transpose. Each output O_k refers a regression model ofthe type given in eq. (63). The parameters of O_k and projections a_t are estimated sequentially in order to minimize the mean squared error (MSE) ofthe residuals:

and where the term between the curly brackets denotes the i residual ofthe μth data point, r_k , with r = y^μ . In order to further optimize the projection directions, backfitting can be applied: C(Λ_K) is cyclically minimized for the residuals of projection direction k until there is little or no change.

Determining the smoothing factor Training for the smoothing factor p_s can also be done with the delta rule, however, better results (i.e., more reliable than would be achieved by learning) can be obtained in the following manner. The following quantity is defined as the training error:

1 ^M II II f" " with O = [O,], with O, as in eq. (63). In a modification of this method a test set rather than a training set is used, and the test error is determined rather than the training error.

The decision depends on how many training samples are being used, and how many are available for testing.

TE is plotted as a function of p_s< and the p_s-vahιe that minimizes TE is sought. A clear parabolic (TE, p_s) plot is expected. Only three points are then necessary to locate the minimum, theoretically, but the effect of numerical errors must be considered.

For example, the following strategy can be adopted for determining three TEs

(assuming that p_r > 1 in kMER so that the activation regions overlap): 1. starting with p_s = 1 , the regression model is trained and the first training error, TE¹, is determined

2. p_s is increased, e.g., to 1.1, the regression model is trained again and the second training error, TE², is determined

3. p_s is decreased, e.g., to 0.9, the regression model is trained again and the third training error, TE³, is determined.

From the relative magnitudes of TE¹, TE², TE³, the location can be estimated, or the direction where the minimum should be sought determined.

13.2.2. Summary The characteristics ofthe present regression embodiment can be summarized as follows: 1. the data model can be developed separately from the regression model: it is simply an add-on of the data model;

2. training the regression model will be fast since only one stage ("layer") of connection weights need to be trained; 3. the regression procedure adapts itself to the data (self-organization): since with kMER the weight density will be proportional to the input density, the regression model will be optimal in the sense that it will be locally more detailed when there are locally more input data available (higher density in x-space), and vice-versa; 4. by virtue ofthe delta rule and the smoothness procedure, regression modeling is easily implementable.

14. Horizontal modularity embodiment

An important feature ofthe kernel-based approach is that the regression model does not extrapolate beyond the range of its outermost kernel (which can be easily detected): a zero output will be obtained. This enables to develop a horizontally- modular approach to regression and, thus, to exploit parallelism. Basically, there are two ways to look at this type of modularity: the data can be partitioned into subsets of the input space data set or into subspace data sets (Fig. 25 A, B). Evidently, also mixtures ofthe two can be considered (Fig. 25C).

14.1. Subsets of data - single regression module

Assume m subsets of input/output pairs are available (see "Training subsets input vectors" and "Training subsets desired output scalars" in Fig. 26). Each subset of input vectors is used for developing a corresponding data model (data model 1, ..., m). The weights w, and radii σj available in the m data models are then used for introducing kernels g, in the regression module. The denominator in the normalized Gaussians definition refers to all kernels of all data models. Finally, the W and p_s are determined, across the m data models, according to the scalar regression task.

14.1.1. Vectorial regression task

The former can be extended to a vectorial regression task, and desired output vectors can be considered instead of scalars. d_y regression models are developed, one for each output vector component. This can be done in parallel. 14.2. Subsets of data - multiple regression modules

The previous regression modeling strategy can be extended by breaking up the single regression module into m + 1 regression modules at two levels (Fig. 27): at the first level, there are m regression modules, one for each subset of input/output pairs, and at the second level, there is one module that integrates the outputs ofthe first level modules in order to produce the final regression output O. The m level 1 regression modules are trained as before. For the level 2 module, the following weighted sum ofthe outputs is taken as the level 1 regression modules: m 0(x) = ∑WO_j O_j(x,p_s). (69)

7=1

Again, the WO_j are parameters that can be trained in the same way as explained before, or in a similar way. If the subsets are perfectly disjunct, then one can simply take the sum of all O

Bayesian approach Another embodiment that does not require additional training will now be introduced: it will take advantage ofthe fact that the regression surface developed in each module will be more detailed when more samples are available. Hence, if there is disposed of a density estimate in each module, then there will be more certainty ofthe regression output produced by a given module if the corresponding input also corresponds to a high local density.

Rather than having a common second level regression module as in Fig. 8, which requires global learning, a common decision module is introduced, which does not require additional training. It goes as follows. Let p_j(x \j) be the (conditional or component) density estimate of they^'th module for input x. The regression output produced by the module, i*, for which the product ofthe conditional density and the prior probability is the largest is taken: i* = arg min (p (x| /) ( )), (70)

7 with P(j) the (prior) probability of they^'th module. Unless disposing of prior knowledge, all P(J)s can be assumed to be equal to Urn, hence: i* = arg min p _} (x | j) . (71) For both cases, the regression output O becomes:

0(x) = O_t,(x). (72)

In other words, the regression output is taken for which the largest number of local input/output pairs were available for training. This selection criterion is reminiscent of a Bayesian pattern classification procedure (whence the procedure's name). The selection itself occurs in the decision module. An estimate ofthe probability density of each module is obtained from the data modules as explained above with respect to the topographic map embodiments.

An alternative is to inteφolate between the regression outputs produced by all modules according to their density estimates. The regression application's output O can then be written as follows:

Bayesian approach - Risk minimization

Maximizing the product ofthe conditional density and the prior probability might not be appropriate for some applications. For example, when the consequence of erroneously taking the output of module i is much higher than that of module j. In order to take such effects into account, one can define a loss matrix L = [ ,₇] of which each element £,₇ is the loss incurred by erroneously taking module j when actually module i should have been taken. The risk is then minimized when:

∑L_ki.p(x \ i*)P(i*) < ∑L_kjp(x \ j)P(j), V/ ≠ i *. (74)

Ar=l 4=1

The L_yS can be based on experience or expressed in a more quantitative manner, for example, in monetary terms.

Advantages

The advantages ofthe present embodiment can be summarized as follows: 1. No additional training is required, a feature which is particularly important when dealing with regression modules that are developed on separate computers (data servers): in this way, no global communication between the data servers is needed in order to build a distributed regression application. In the more traditional procedure, a global communication of the training samples would be required. 2. A new set of data that one wishes to consider in the regression application can be included without retraining the whole application: the new data set can be regarded as a subset for which a separate data module and regression module are developed. This can be done in parallel, thus, while the original regression application is in use. 3. Each regression module can be supplemented with a risk factor so that risk minimization can be performed when considering the outputs ofthe regression modules.

14.2.1. Vectorial regression task The previous embodiment can be extended to a vectorial regression application, thus with desired c^-dimensional output vectors, rather than scalars. d_y regression models are developed in parallel, one for each output vector component. All training processes can run in parallel.

14.3. Subspaces - multiple regression modules 14.3.1. Scalar regression task

The complementary problem is now considered: the data set is partitioned along m disjunct subspaces. There is first focused on the input vectors. The input space ofthe data set is partitioned along m subspaces of which the dimensionalities are d_t, 0 < d_t ≤ d, ∑™ di = d . For example, if input sample x = (x_{, ..., x_d) is considered, then data module 1 is developed using the subspace vector x_λ, ..., x_d ) , data module 2 using (x_{d +1} , ..., x_d ) , and finally, data module m using (x_d _ ₊₁ , ..., x_d ) .

In the event that the input data for these subspaces are used for developing the data modules on separate data servers, then this modeling paradigm cannot be addressed without relying on a global communication. The present embodiment addresses how much communication is needed, and what is the nature ofthe information that is communicated.

Since the data modules are developed separately from the regression application, and since kMER is used for it, an elegant solution can be found that minimizes the need for global communication. This results in a new procedure, compatible with the one introduced in the previous subsection. For the sake of exposition, it is assumed that the subspace data sets are stored onto separate data servers. However, it is clear that they can be grouped onto one or more servers. The overall scheme is shown in Fig. 29.

For each data server, a data module is developed with kMER using the subspace vectors that are locally available. It is assumed that the desired output values (scalars) are available on each data server, m (level 1) regression modules are then locally developed, as explained in section 13.2.1, thus, using the subspace vectors as input vectors. The outputs of the m regression modules still have to be integrated into one global regression result. This is done by applying backfitting (see section 13.2.1).

In order to generate an initial regression estimate, from which backfitting can start, the outputs produced by each ofthe m regression modules are averaged:

0(x^μ) y^μ = ∑ -O_k (a_kx^μT) , (75)

with a_k a vector, the components of which are 1 when the corresponding input dimension is coded for by the kth regression model, else it is 0. Note the resemblance with projection pursuit regression (PPR), except that here the a_t are not subject to optimization.

Next, the following algorithm is considered that is applied to a training set that is allowed to be different from the one used for training the m regression modules.

Backfitting intialization For each module j, the module's regression output O,(a_y x^) is determined for each input vector ofthe training set. These outputs (i.e., scalars) are then communicated to all other data servers. Hence, in this way, each data server disposes ofthe regression outputs produced by all regression modules for all the input vectors used for training. A module index, i, is intruduced and is put i <— 1. Let she a preset level of training error. The total regression error (TRE) is defined as follows:

1 . M_ ____ ( I ___ m____ ι 1 _rmT (76)

M μ=\ k=\ m which is preferably determined for a separate test set, rather than the training set. We communicate the TRE to regression module 1.

do until TRE < ε or until maximum number of iterations is reached for (i <= 1; i < m; i <= i + 1) packfitting run i

Module 'S regression parameters are adapted so as to reduce the total regression error (gradient descent step, in batch mode); the parameters of all regression modules are kept constant. The regression outputs of module i are determined for each subspace input vector ofthe training set and communicated to the next module on which a learning step is going to be performed, module i + 1. We increment The module index is incremented, i - i + 1, and another backfitting run is performed.

In summary, a continued cyclical update ofthe modules is carried out until the total regression error is small enough or a more elaborate stop criterion has been satisfied

(e.g., either TRE < £"or the maximum number of iterations has been reached). It should be noted that only a restricted amount of information is exchanged between the data servers, and only when needed during learning. Furthermore, only model outputs are exchanged and not input data, which would require more communication effort and could, e.g., raise concerns about ownership and confidentiality ofthe data, etc.

Finally, when backfitting has stopped, the regression modules can be used as follows. First, the subspace input vectors are applied to each regression module, then the corresponding regression module outputs O_k are determined and communicated to the level 2 regression module which, in turn, calculates the final regression result: m I

0(x^μ) y^μ = ∑ O_k (z_kx^μT) . (77) k=\ ^m

14.3.2. Vectorial regression task

The previous embodiment can be easily extended to vectorial regression, thus with ^-dimensional desired output vectors, rather than scalars: d_y regression models are developed, one for each output vector component. All training processes can run in parallel.

14.4. Subsets of subspaces - mixed case The case where the data modules are trained on subsets that consist of mixtures of input space vectors as well as subspace vectors is now considered, thus with missing input vector components (see Fig. 25C). There are two possibilities. If there are plenty of input space vectors available for developing the data modules and/or the subspace vectors only have a limited number of missing vector components, then the presence ofthe missing vector components can be ignored and the strategy mentioned under section 14.2 can be applied: m regression models are developed using the input space vectors as well as the subspace input vectors (Fig. 30). For the subspace input vectors the data completion procedure explained in section 13.1.2 is used.

In the opposite case, if there are plenty of subspace vectors available, then the strategy mentioned under section 14.3.1 should be considered, and the data modules should be developed for each subspace separately, thereby ignoring the additional vector components that some ofthe data vectors might have (Fig. 31).

The first strategy is clearly an heuristic one: its success will critically depend on the ratio between the number of complete and incomplete input vectors, and the number of missing input vector components. The second strategy is in principle correct but it requires much more data communication than in the previous case.

Finally, there is also the general case where the data vector dimensionalities vary in the training subsets. The subspace dimensionality that is used for each data module might correspond to the set of vector components that are in common to the training subset or, if this is too extreme, a common set of vector components can be chosen or determined, and data completion can be performed on the missing vector components and the ones that do not belong to this common set can be simply ignored. Evidently, such an approach is also an heuristic one.

14.4.1. Vectorial regression task Again, vectorial regression application can be performed by developing d_y regression models in parallel.

15. Utility

The systems and methods described above may be used to model a physical system using a topographic map. A physical parameter ofthe modeled system may be changed, optimized, or controlled in accordance with an estimated data density determined in accordance with the present invention. 16. Definitions

Equiprobabilistic: All units (neurons) in the lattice (neural network) have an equal probability to be active. All representational units should participate with the same probability in the representation. All kernels contribute equally in the density estimate provided by the topographic map (prior class probabilities are equal, if kernels would represent a different class).

Entropy: entropy of a variable is the average amount of information obtained by observing the values adopted by that variable. It may also be called the uncertainty of the variable.

Kernel-based: a particular type of receptive field - it has the shape of a local function, e.g. a function that adopts its maximal value at a certain point in the space in which it is developed, and gradually decreases with distance away from the maximum point.

Topographic map: a mapping between one space and another in which neighboring positions in the former space are mapped onto neighboring positions in the latter space. Also called topology-preserving or neighborhood-preserving mapping. If there is a mismatch in the dimensionality between the two spaces there is no exact definition of topology preservation and the definition is then restricted to the case where neighboring positions in the latter space code for neighboring positions in the former space (but not necessarily vice versa).

Receptive field: the area or region or domain in the input space within which a neuron (generally synonymous with synaptic element of a neural network) can be stimulated.

Self-organizing: refers to the genesis of globally ordered data structures out of local interactions.

Non-parametric density estimation: no prior knowledge is assumed about the nature or shape ofthe input data density. Non-parametric regression: no prior knowledge is assumed about the nature or shape ofthe function to be regressed.

Claims

1. A computer implemented method of extracting hidden relationships within an input data signal, comprising the steps of: developing from the input data signal an unsupervised, self-organizing, kernel- based, topographic map by maximizing the entropy ofthe map's output, the topographic map being executed by a neural network; during the development step receptive fields ofthe topographic map are truncated to receptive field regions, the receptive field regions being independently scaleable in the space ofthe input data signal; and estimating a data density using at least one kernel.

2. The computer implemented method according to claim 1, wherein the map is equiprobabilistic.

3. The computer implemented method according to any previous claim, wherein at least some ofthe receptive field regions overlap.

4. The computer implemented method according to claim 3, wherein the receptive fields are defined by the same kernel function.

5. The computer implemented method according to claim 4, wherein the receptive field regions are hyper-spheroidal.

6. The computer implemented method according to any of claims 3 to 5, in which at least some ofthe receptive fields regions are overlapping and the overlap is used to determine a degree of topology preservation ofthe topographic map.

7. The computer implemented method according to any previous claim, wherein a degree of topology preservation is used to determine when the topographic map has to be re-trained because of a change in input data distribution.

8. The computer implemented method according to any previous claim, wherein a degree of topology preservation is used to determine when training ofthe topographic map using the input data signal is stopped.

9. The computer implemented method according to any previous claim, further comprising the step of removing the truncation ofthe receptive field regions after training of the topographic map, each receptive field then being defined by a kernel.

10. The computer implemented method according to any previous claim, in which the data density estimation step comprises using a representation ofthe input data which approximates the input data density distribution.

11. The computer implemented method according to claim 10, wherein the representation is a principle manifold having less dimensions than the input data signal.

12. The computer implemented method according to claim 10 or 11, wherein the principle manifold is the topographic map.

13. The computer implemented method according to any previous claim, wherein the developing step comprises adapting the center location and the extent of each receptive field region to the input data density ofthe part ofthe input data signal data which activates the kernel thereof.

14. The computer implemented method according to any previous claim, in which a set of kernels of the topographic map is activated, wherein an activation distribution ofthe topographic map is determined in accordance with a membership function defined with respect to the set of activated kernels.

15. A computer implemented method of extracting hidden relationships within an input data signal, comprising the steps of: developing from the input data signal an unsupervised, self-organizing, kernel- based, topographic map executed by a neural network; during the development step receptive fields ofthe topographic map are truncated to receptive field regions, the receptive field regions being independently scaleable in the space ofthe input data signal and at least some ofthe receptive field regions being overlapped; and estimating a data density using at least one kernel

16. The computer implemented method according to claim 15, in which the overlap of the receptive field regions is used to determine a degree of topology preservation of the topographic map.

17. The computer implemented method according to claim 15 or 16, wherein a degree of topology preservation is used to determine when the topographic map has to be retrained because of a change in input data distribution.

18. The computer implemented method according to any of claims 15 to 17, wherein a degree of topology preservation is used to determine when training of the topographic map using the input data signal is stopped.

19. The computer implemented method according to any ofthe claims 15 to 18, in which the data estimation step comprises using a principle manifold which approximates the input data density distribution.

20. The computer implemented method according to claim 19, wherein the principle manifold is the topographic map.

21. The computer implemented method according to any of claims 15 to 20, wherein the developing step comprises the center location and extent of each receptive field region to the input data density ofthe part ofthe input data signal data which activates the kernel thereof.

22. The computer implemented method according to any ofthe claims 15 to 23, in which a set of kernels ofthe map is activated, wherein an activation distribution ofthe topographic map is determined in accordance with a membership function defined with respect to the set of activated kernels.

23. A computer implemented method of extracting hidden relationships within an input data signal of at least two dimensions, comprising the steps of: developing from the input data signal an unsupervised, self-organizing, kernel- based, topographic map executed by a neural network; and estimating a data density using at least one kernel, the estimation step comprising: using a principle manifold which approximates the input data density distribution.

23. The computer implemented method according to claim 22, wherein, during the development step, receptive fields ofthe topographic map are truncated to receptive field regions and at least some ofthe receptive field regions overlap.

24. The computer implemented method according to claim 23, wherein the overlap is used to determine a degree of topology preservation.

25. The computer implemented method according to any of claims 22 to 24, wherein a degree of topology preservation is used to determine when the topographic map has to be re-trained because of a change in input data distribution.

26. The computer implemented method according to any of claims 22 to 24, wherein a degree of topology preservation is used to determine when training ofthe topographic map using the input data signal is stopped.

27. The computer implemented method according to any of claims 22 to 26, wherein the principle manifold is the topographic map.

28. The computer implemented method according to any ofthe claims 23 to 27, wherein the developing step comprises adapting the center location and extent of each receptive field region to the input data density ofthe part ofthe input data signal data which activates the kernel thereof.

29. The computer implemented method according to any ofthe claims 22 to 28, in which a set of kernels of the map is activated, wherein an activation distribution of the topographic map is determined in accordance with a membership function defined with respect to the set of activated kernels.

30. A computer implemented method of extracting hidden relationships within an input data signal, comprising the steps of: developing from the input data signal an unsupervised, self-organizing, kernel- based, topographic map executed by a neural network, the developing step comprising: truncating receptive fields ofthe topographic map to receptive field regions; adapting the center location and extent of each receptive field region to the input data density ofthe part ofthe input data signal which activates the kernel thereof; and estimating a data density using at least one kernel.

31. The computer implemented method according to claim 30, wherein at least some ofthe receptive field regions overlap.

32. The computer implemented method according to claim 31, wherein the overlap is used to determine a degree of topology preservation of the topographic map.

33. The computer implemented method according to any of claims 30 to 32, wherein a degree of topology preservation is used to determine when the topographic map has to be re-trained because of a change in input data distribution.

34. The computer implemented method according to any of claims 30 to 32, wherein a degree of topology preservation is used to determine when training ofthe topographic map using the input data signal is stopped.

35. The computer implemented method according to any ofthe claims 30 to 34, in which the estimation step comprises using a principle manifold which approximates the input data density distribution.

36. The computer implemented method according to claim 35, wherein the principle manifold is the topographic map.

37. The computer implemented method according to any ofthe claims 30 to 36, in which a set of kernels ofthe map is activated, wherein an activation distribution ofthe topographic map is determined in accordance with a membership function defined with respect to the set of activated kernels.

38. A computer implemented method of extracting hidden relationships within an input data signal, comprising the steps of: developing from the input data signal an unsupervised, self-organizing, kernel- based, topographic map executed by a neural network in which a set of kernels is activated; determining an activation distribution ofthe topographic map in accordance with a membership function defined with respect to the set of activated kernels thereof; and estimating a data density using at least one kernel.

39. The computer implemented method according to claim 38, wherein during the development step receptive fields ofthe topographic map are truncated to receptive field regions and at least some ofthe receptive field regions overlap.

40. The computer implemented method according to claim 39 wherein the overlap is used to determine a degree of topology preservation.

41. The computer implemented method according to any of claims 38 to 40, wherein a degree of topology preservation is used to determine when the topographic map has to be re-trained because of a change in input data distribution.

42. The computer implemented method according to any of claims 38 to 40, wherein a degree of topology preservation is used to determine when training ofthe topographic map using the input data signal is stopped.

43. The computer implemented method according to any ofthe claims 38 to 42, in which the estimation step comprises using a principle manifold which approximates the input data density distribution.

44. The computer implemented method according to claim 43, wherein the principle manifold is the topographic map.

45. The computer implemented method according to any of the claims 39 to 44, wherein the development step comprises adapting the center location and extent of each receptive field region to the input data density ofthe part ofthe input data signal data which activates the kernel thereof.

46. The computer implemented method according to any previous claim, in which receptive fields ofthe topographic map are deterministic.

47. The computer implemented method according to any previous claim, wherein the data density estimation is non-parametric.

48. The computer implemented method according to any previous claim, further comprising the step of smoothing the topographic map.

49. The computer implemented method according to any previous claim, comprising the step of clustering.

50. The computer implemented method according to claim 49 wherein the clustering step, is carried out by hill-climbing.

51. The computer implemented method according to claims 49 or 50, comprising the step of unsupervised classification.

52. The computer implemented method according to any ofthe claims 49 to 51, comprising the step of further classifying the clusters into smaller clusters to form a tree structure.

53. The computer implemented method according to any previous claim, further comprising the step of execution of a physical process step external to the computer in response to the estimated data density.

54. A system for extracting hidden relationships within an input data signal, comprising: a processing engine for developing from the input data signal an unsupervised, self-organizing kernel-based, equiprobabilistic topographic map executed by a neural network, receptive fields ofthe topographic map being truncated to receptive field regions during the development step and the receptive field regions being independently scaleable in the space ofthe input signal; and a first processing tool for estimating the data density of at least one kernel.

55. A system for extracting hidden relationships within an input data signal, comprising: a processing engine for developing from the input data signal an unsupervised, self-organizing kernel-based, topographic map executed by a neural network in which, during the development step receptive fields ofthe topographic map are truncated to receptive field regions, the receptive field regions being independently scaleable in the space of the input data signal and at least some ofthe receptive field regions being overlapped; and a first processing tool for estimating the data density of at least one kernel.

56. A system for extracting hidden relationships within an input data signal, comprising: a processing engine for developing from the input data signal an unsupervised, self-organizing kernel-based, topographic map executed by a neural network; a first processing tool to generate a principle manifold which approximates the input data density distribution; and a second processing tool for estimating the data density of at least one kernel using the principle manifold.

57. A system for extracting hidden relationships within an input data signal, comprising: a processing engine for developing from the input data signal an unsupervised, self-organizing kernel-based, topographic map executed by a neural network, receptive field ofthe topographic map being truncated to receptive filed regions; a first processing tool for adapting the center location of each receptive field region to the input data density ofthe part ofthe input data signal which activates the kernel thereof; and a second processing tool for estimating the data density ofthe kernel.

58. A system for extracting hidden relationships within an input data signal, comprising: a processing engine for developing from the input data signal an unsupervised, self-organizing kernel-based, topographic map executed by a neural network in which a set of kernels is activated; a first processing tool for determining an activation distribution ofthe topographic map in accordance with a membership function defined with respect to a set of activated kernels; and a second processing tool for estimating the data density using the kernels.

59. A hierarchically organized data processing system having at least a first and a second lower level and implementing topographic maps executed by neural networks, each ofthe first and second levels comprising at least one topographic map having at least two neurons, and a topographic map ofthe second level overlapping with a topographic map ofthe first level.

60. The hierarchically organized data processing system according to claim 59, further comprising distributed data processing nodes and the topographic maps are distributed among the data processing nodes.

61. The hierarchically organized data processing system according to claim 59 or 60, wherein the topographic maps are those described in any of claims 1 to 53.