EP2396752A2

EP2396752A2 - System and method for computer-based analysis of large amounts of data

Info

Publication number: EP2396752A2
Application number: EP09827247A
Authority: EP
Inventors: Ansgar Dorneich
Original assignee: Optimining GmbH
Current assignee: Optimining GmbH
Priority date: 2008-11-19
Filing date: 2009-11-19
Publication date: 2011-12-21
Also published as: WO2010057642A1; DE102008058016A1; WO2010058299A3; WO2010058299A2

Abstract

For a computer system used for data analysis, the training time is to be significantly reduced through technical means; also, the storage space required is to be noticeably reduced through the use of technical measures. To this end, an electronic data processing system for analyzing data is proposed, comprising at least one analysis computer, wherein the analysis computer is adapted and programmed to implement a self-adapting neural network that is subjected to training by a plurality of data sets with many features, wherein the neurons of the neural net are assigned initial neuron weights, the neurons of the neural net are assigned neuron weights that are extracted from said plurality of data sets with said many features, a training involves a plurality of training phases, and wherein each training phase comprises a certain number of training cycles, wherein at the beginning of each training phase, either neurons whose neuron weights are made up of weights of existing neurons, at least partially, are added into the neural network, or neurons are removed from the neural net and the neuron weights of the remaining neurons are weighted with portions of the weights of the removed neurons, at least partially.

Description

System and method for computer-based analysis of large data sets

description

background

Currently available, inexpensive computer programs for data analysis (eg Data Cockpit ^® 1:04) are in the analysis significantly slower than competing data mining workbenches (SPSS and others), only significantly smaller amounts of data can BEITEN processed, and have (other disadvantages are as a monolithic block programmed, they are in their architecture and data handling unsuitable for client-server architecture, etc.).

An established method for the segmentation of data OCIusteringO as well as for the prediction is the procedure of the self-organizing feature maps, English SOM ('self organizing maps'). For segmentation or prediction, the data are mapped onto a one-, two-, or more-dimensional self-adapting neuron network in this method. [T. Kohonen. Self-Organization and Associative Memory, vol. 8 of Springer Series in Information Science, 3rd edition, Springer-Verlag, Berlin, 1989].

In the SOM-based data analysis, the so-called 'Kohonen clustering' and the so-called SOM map analysis are differentiated. Kohonen clustering works with very few neurons, typically between about 4 and about 20 neurons. Each of these neurons represents a 'cluster', ie a homogeneous group of data sets. This technique is used primarily for data segmentation and is implemented in many data mining software packages, such as SPSS Clementine or IBM DB2 Warehouse (see, for example, Ch. Ballard et al., Dynamic Warehousing: Data Mining Made Easy, IBM Redbook, 2007).

In contrast, the SOM map analysis uses relatively large neural networks of, for example, 30 to 40 neurons for data analysis. Homogeneous data segments are represented by local groups of neurons with similar characteristics. SOM maps are used for data exploration, segmentation, prediction, simulation, and optimization (see, for example, R. Otte, V. Otte, V. Kaiser, Data Mining for Industrial Practice, Hanser Verlag, Munich, 2004).

As examples of further technological background may be mentioned EP 97 11 56 54.2 and EP 97 12 0787.3. To analyze a large collection of data compiled on a computer - for example, production data from a production plant with approximately 10 ⁴ to 10 ¹⁰ data sets and approximately 3 to 1000 characteristics per data record - and, if necessary, to feed the results of the analysis back into the production process , the existing data sets are repeatedly presented to a learning and self-adapting neuron network.

This may be production data in the engineering, chemical, automotive, and supplier industries: for example, 10 million units produced, 10 nominal component and production line information, 10 binary component and equipment information, 10 numerical production data (measured ToleranzdateKv sensor data, recorded production times, machine data, ...) The aim of the SOM analysis here is the quality assurance, error source analysis, early warning, production process optimization.

Another example would be customer data in retail, financial or insurance companies: 10 million customers, 10 nominal demographics (marital status, occupational group, region, type of dwelling, ...), 10 binary characteristics of interests and services / products used (sex ; owns credit card, operates online banking, ...), 10 numerical characteristics (annual income, age, annual turnover, creditworthiness, ...) ■ The aim of the SOM analysis is customer segmentation, the prediction of

Customer value, creditworthiness, risk of damage, ... as well as the optimization of marketing campaigns.

Each neuron of the self-adapting neuron network has as many signal inputs as each of the individual data sets has characteristics. If the neuron network has learned the data, the following tasks can be performed with the trained neuron network, among others:

• Visual interactive data exploration: interactive discovery of interesting subgroups, correlations between features and general contexts using various visualizations of the data generated from self-organizing feature maps.

Segmentation: Divide all data into homogeneous groups.

• Prediction: prediction of previously unknown characteristic values in individual data sets. • Simulation: How would certain characteristics of a dataset likely change if certain other characteristic values were changed in a targeted way? • Optimization: When certain optimal Ausprä ^¬ conditions are to be met for a subset of the features, how should then be chosen the other characteristic values?

Existing methods and implementations of SOM-card-based data analysis currently require too long training times of the neuron networks for their commercial substitution. These training times exceed those of other data mining techniques on the same data by about a hundredfold and hinder the application of such existing software packages to many existing data collections and issues with the currently available computer performance.

For example, to use the DataCockpit ^® software to train a SOM network of 30 ^■ 40 neurons on a large database of 60,000,000 data sets with 100 characteristics, a server with one to two 3 GHz ^Intel® CPUs and 64 GigaByte RAM would need about Calculating 2 - 3 months continuously - this would be completely unacceptable in practice.

Technical problem

Thus, there is the technical requirement to significantly reduce this training time by means of technical precautions in order to allow the evaluation of large amounts of data in a short time. For example, the calculation time for the above example should be reduced to about 100 hours or less.

Summary

To solve the problem, an electronic data processing system for analyzing data with at least one analysis computer is proposed, wherein the analysis computer is set up and programmed to implement a self-adapting neuron network. The neuron network is subjected to a training with a large number of data records with many features, in which the neurons of the neuron network from the multiplicity of data sets with their many features are to be assigned neuron weights to be assigned. A training can comprise several training phases, each training phase having a certain number of training courses, and at the beginning of each training phase either neurons being inserted into the neuron network whose neuron weights result at least in part from weights of existing neurons, or neurons from the neuron network must be removed and the neuron weights of the remaining neurons at least partially weighted with parts of the weights of the removed neurons. Furthermore, a method for training a neuron network is proposed, which comprises the following steps: Saving the number of features (columns) in the training data in a first

Value. ■ Perform the following steps for all neuron network size changes: o Save the number of neurons in the network in a second value, o Store initial neuron weights in a two-dimensional first field, with a first dimension of the field after the first value and determining a second dimension of the field after the second value, o performing the following steps for all iteration steps:

^■ Perform the following steps for all datasets or a subset of training datasets:

• reserving a second field for storing distances between the current training data set and all neurons, • setting all values in this second field to a uniform initial value,

Setting a value for a minimum distance to a predetermined value which is chosen to be greater than all actual distances between the current training data set and each neuron of the neural network,

• Perform the following steps for all features that have a valid feature value in the current record: o Perform the following step for all neurons:

^■ adding the distance value between the neuron weight stored at a location of the first field determined by the first value and the second value, and the valid feature value to a value at a location of the second field determined by the second value, • Execute the following steps for all neurons for which the value stored at the second field location specified by the second value is less than the minimum distance value: o setting the minimum distance to the value determined by the second value Position of the second field is stored, o setting the current neuron as the best neuron, • performing the following steps for all characteristics m, which have a valid value in refreshes ^¬ economic training data set: o moving those stored in the first field

Neuron weights of the best neuron, which correspond to features that are valid in the current training data set

Values have, in the direction of the corresponding valid feature values of the current training data set, and o shifted neuron weights of certain neighbor neurons of the best neuron stored in the first field which correspond to features present in the current field

Training data set have valid values in the direction of the corresponding valid feature values of the current training data set.

The first loop over all features can be replaced by multiple loops.

One of the loops may have numeric features, a loop may have binary features, and / or a loop may iterate over textual features.

The first field may be designed such that it consists of a number of gapless sequences corresponding to the first value of each number of numerical field cells corresponding to the second value.

The distances between the neuron weights and the feature values of the current training data set may be quadratic distances.

The training data may be compressed and indexed prior to the beginning of the procedure, textual values being discretized into discrete intervals by integer value indices and / or floating point values.

With an enlargement of the neuron network (expansion step), the weights of the newly inserted neurons can be determined by linear, cubic or other interpolation, if they are internal neurons and / or the weights of the newly inserted neurons can be determined by extrapolation, if they are marginal neurons. With a reduction (reduction step) of the neural network, each neuron can replace several ^¬ re adjacent existing neurons and in each of its neurons weights inherit the average of the corresponding neuron weights of the replaced neurons.

The neuron network size can be increased at the beginning of each training phase, either by inserting neurons into the neuron network whose neuron weights result at least in part from weights of existing neurons (expansion step) or the neuron network can by removing neurons be reduced from the neuron network (reduction step), wherein the neuron weights of the remaining neurons in the removal at least partially to be weighted with parts of the weights of the removed neurons.

All distances between the predetermined number of neurons and the current training data set can be quadratic distances or the distances can each have a distance measure that has the properties of a metric.

For each training phase, at least a selection of the data sets may be used to weight the neuron weights of the neurons of the neuron network, with a different number of training courses for the training of the neurons for each training phase, depending on the current size of the neuron network the features may be selected, wherein the training runs are to be performed so often until the maximum predetermined number of training runs has been reached, or the training is converging in that the feature weights of the neurons do not significantly change.

It can further be provided that at least one training phase is to be executed between two training phases for which neurons are to be inserted into the network, for which neurons are to be removed from the network. This procedure leads to a very fast convergence of the values, and thus to the end of the training.

In a further embodiment, the removal of a neuron may be such that when removing the neuron only the remaining neurons immediately adjacent to the neuron to be removed are to be re-weighted, or the remaining neurons by linear or cubic or exponential spline interpolation be re-weighted according to another interpolation rule involving several neighbor neurons. The neurons of the neuron network can be arranged as nodes of a multi-dimensional, for example, two-dimensional matrix. In such a case, when Entfer ^¬ NEN or inserting Neurons from / to the neural network from the matrix rows or columns can remove / insert his.

The weights of all neurons for a particular feature may be structured such that they are to be stored in a contiguous memory area of an analysis computer.

The initial neuron weights of the neurons of the neuron network may be determined by a heuristic procedure. The method may be such that the features need to be read only once prior to the start of the training, and only once to be transformed to numeric features. The features may be stored compressed prior to training as training data.

An analysis computer can create an initial configuration of the neuron network and training parameters and send the initial configuration and the training parameters to at least one other analysis computer. The initial configuration of the neuron network and the training parameters can be read in by the at least one further analysis computer.

The analysis computer can send the neuron weights and / or a learning rate and / or a radius and / or the number of iteration steps to at least one other analysis computer for all training phases and / or for all training runs. The at least one further analysis computer can read in the information sent by the analysis computer and in each case calculate distances between a multiplicity of neurons for the number of training runs read in, determine a winning neuron, save the winner neuron in each case in a list and search the list Send the number of iteration steps to the analysis calculator. The analysis computer can then receive the list of winning neurons from the at least one further analysis computer and, based thereon, modify the weights of the winner neurons and their neighbors.

The training data can be divided by the analysis computer into data objects and the data objects are sent to at least one further analysis computer, wherein the data objects can be dimensioned prior to sending so that they are completely in the memory of at least fit another analysis calculator. In order to determine the distances between the neurons and the current training data set, the method can be designed in such a way that only gap-free sequences of memory fields are accessed in its most computationally intensive part.

To determine the distances between the neurons and the current training data set, long loops over a maximum of 2 memory field variables can be used.

For each nominal feature in the training data, occurring nominal values may be stored in a directory in which each feature value is assigned a tentative index and which additionally counts the occurrence frequency of a feature, and each denomination may be replaced by the tentative index.

The created directory can be sorted by frequency of occurrence, a number of common values can be assigned to a new index, and the preliminary indexes can be replaced by the new indexes.

Brief description of the drawings

Fig. 1 shows an electronic data processing system for analyzing data. Fig. 2 shows schematically an expansion step of the multi-grid process. Fig. 3 shows schematically a reduction step of the multi-grid method. Fig. 4 shows a first part of the technique underlying the method. Fig. 5 shows a second part of the technique underlying the method. FIG. 6 shows a variant of the technique from FIG. 1.

Detailed description

The proposed embodiment has the technical effect of increasing the efficiency and security of the data analysis. Another technical effect is to reduce the requirements for the required computer resources over the conventional approach. Finally, the data transfer rate and the subsequent data processing are positively influenced.

This enables efficient analyzes and evaluations, eg neuron network analyzes on analysis servers with relatively little main memory (RAM). In contrast, the previous implementation of the DataCockpit software, for example, can only process data up to about 200 - 400 megabytes in size due to software engineering. These are limitations that significantly penalize SOM card-based data analysis based on this traditional or comparable software technology. Thus, technical measures are taken describe the benefits of SOM map-based data analysis for larger data volumes on smaller machines.

Referring to Fig. 1, an electronic data processing system is used to analyze data. The electronic data processing system has an analysis server 10 and one or more on-site client computers 12. The analysis server is, for example, a PC with several 3 GHz ^Intel® CPUs and 64 GigaBytes of RAM as main memory. In it, a self-adapting neuron network is to be implemented as a data object to be trained on a large database with a multiplicity of data records with many features. The on-site client computer 12 is set up and programmed to subject data supplied to it to data preprocessing and / or data compression before the data is sent to the analysis server 10 via an electronic network 14, for example the Internet be sent. The analysis server 10 is also configured and programmed to train the self-adaptive neuron network with the received preprocessed / compressed data by repeatedly presenting the data to the self-adapting neuron network and then performing an analysis to create a self-adapting neuron network model. The analysis server then causes the self-adapting neuron network model to be sent from the analysis server 10 to the on-premises client computer 12 also via the network 14. The on-premises client computer 12 is finally there set up and programmed to decompress the data of the self-adapting neuron network model.

Training the SOM networks assumes a heuristically-chosen starting state of the network, which is then iteratively improved until the learning process converges. With SOM networks, different network sizes are suitable for different types of questions.

Relatively small networks of 10 to 100 neurons are sufficient to work out the coarse structures and clusters in the data, and to first move the initial heuristic solution into those areas of the data space that are actually filled with data points. In a data room with, for example, 50 features, each with 4 types of production data, there are already 4 ⁵⁰ ~ 10 ³⁰ points, which in principle can be occupied by data sets. But if there are only 10 ⁷ or fewer records, only one out of 10 ²³ possible points in the Dataroom is actually occupied by a record. Much of the learning is used to shift the weights of the neurons to near data-occupied regions in the data space. For example, to properly reproduce subtle differences within large data clusters, or to accurately represent only rarely occurring feature values, large SOM networks are required.

This is where the multi-grid approach proposed here comes in. In this case, the occupied neurons are relocated to the interesting regions of the data space on a small network with comparatively little computational effort - and therefore in a short time and / or with low hardware resources. The so-called "fine-tuned" network will be fine-tuned for the SOM network, with twice the speed advantage of having fewer neurons, the computation time per iteration is proportional to the number of neurons, and the speed of convergence is faster with fewer neurons because of each Neuron are assigned more records, so that each neuron receives more 'impulses' per iteration, which change its properties (weights) in the desired direction.

In practicing a SOM expansion step, it is possible to halve the 'mesh size' of the mesh. In this case, for example, in the case of a network organized as a two-dimensional matrix, between each two adjacent neurons in the x direction, another neuron is inserted centrally. Subsequently, between each two (old or newly inserted) in the y-direction adjacent neurons centered another neuron is inserted into the SOM network. The weights of the newly inserted neurons can be chosen as an interpolation of the weights of the two neighboring neurons.

In the simplest case, this may be a linear interpolation in which each feature weight of the new neuron is the average of the corresponding feature weight of the two neighboring neurons. Instead of the linear interpolation, a spin interpolation (cubic or exponential splines) can be carried out involving several neighboring neurons.

In order to flank the boundary neurons of the existing mesh on both sides by newly inserted neurons, the weights of the new edge neurons can be calculated, for example, by means of linear extrapolation. Here, the extrapolated weight of the new edge neuron: = 2.3 ^■ (weight of nearest neighbors) - amount to ^1/2 ^• (weight of the next-nearest neighbors). This is shown in Figure 1, with the newly added neurons hatched. Extrapolating binary or nominal characteristic values is also ensured ^¬ that the extrapolated values lie within the permitted values range from 0 to the first

To reverse the network expansion, it lends itself to double the mesh width 'of the network by every other neuron number in the x and y directions is removed from the mat ^¬ rix. In this case, before the removal, the information contained in the neurons to be removed is supplied to the neurons which will be contained in the network resulting from the removal of every second row of neurons. In doing so, the new net inherits the properties of each neuron of the old network with the same weighting. Neurons to be removed with 4 nearest remaining neighbors (in the x and y directions) output their characteristics with a weighting factor of ¹ A to each of the four remaining neighbors. Neurons to be removed with 2 nearest remaining neighbors pass their properties on scores of Vi and these two neighbors. Remaining neurons inherit their own properties with the weighting factor 1. FIG. 2 shows schematically a reduction step. The neurons to be removed are shown hatched in FIG.

In the multi-grid method presented, for example, a lattice expansion scheme can be used which not only progresses from the coarsest to the finest grid, but at least once returns from an already reached fine grid step to the next coarser grid step. The technical advantage of this procedure is that a uniform, faster convergence of all solution vectors is achieved. The computational effort in the SOM network even decreases linearly with each lattice coarsening until convergence is achieved. Therefore, with the SOM expansion in all expansion stages except the last many iterations and also the intermediate return to the next coarser stage cause virtually no calculation time extension. The total computing time is determined almost exclusively by the last, finest expansion stage.

The intended iterations per expansion step can represent upper limits. If the respective SOM network has already converged so far that only minimal changes to the neurons occur, the respective stage can be terminated prematurely.

In an example implementation, which will be described in more detail below, a type of grid expansion step is implemented which halves the mesh size. wide and the addition of new edge neurons (extrapolation). SOM maps with open boundary conditions are implemented. This means that a variant of the SOM network is offered in which every neuron at the left edge is not the right neighbor of a neuron at the right edge and every neuron at the bottom is not the upper neighbor of a neuron at the top.

It should be noted that SOM networks process only numeric characteristics with value ranges between 0 and 1. Therefore, before the actual start of SOM training, the original data is read once, and the original features are transformed to purely numerical, normalized features.

The example implementation consists of at least one auxiliary ski and at least one main class.

A class 'SOMParameters' is a helper that holds all the parameters of the SOM

Read algorithm from a parameter file and provide it individually.

A class 'SOMTraining' is a main class that, after assigning a parameter object and one or more data processing objects, performs a SOM network by performing several network expansion steps and can output the trained network to a file.

class SOMParameters

{public: // public methods interface

// the constructor reads a parameter file and stores the parameter settings in // the member variables of this class SOMParameters (const string & paramFile = "")

: ivNbNeuronsX (4), ivNbNeuronsY (3), ivNbExpansions (3), ivMaxNeighborDist (2.1), ivLearningRate (0.3), ivMaxMemSizeInMB (512), ivModelName ("som"), ivTempDir ("c: \\") {if ( paramFile == "") return; ifstream file (paramFile.c_str ()); if (! file.is_open () 11 file.eof ()) {cout <<"Unable to open parameter file"'<< paramFile <<""'<<endl;return; } string line, param, value; while (Ifile.eofO) {getline (file, line); size_t posi = line.find ('='); param = line.substr (0, posi); while (param.findC ') <param. length ()) param. erase (param.find (''), 1); value = line.substr (posi + 1, posi <line.lengthQ? line.length () - posi-l: 0); if (param.substr (0,10) == "nbNeuronsX") ivNbNeuronsX = atoi (value. c_str ()); else if (param.substr (0,10) == "nbNeuronsY") ivNbNeuronsY = atoi (value. c_str ()); else if (param substr (0,12) == "nbExpansions") ivNbExpansions = atoi (value. c_str ()); if if (param substr (0,15) == "maxNeighborDist") ivMaxNeighborDist = atof (value. c_str ()); if if (param substr (0,12) == "learningRate") ivLearningRate = atof (value. c_str ()); else if (param.substr (0,14) == "maxMemSizeInMB") ivMaxMemSizelnMB = atoi (value. c_str ()); else if (param substr (0,9) == "modelName") ivModelName = value; else if (param substr (0,7) == "tempDir") ivTempDir = value; else if (param.substr (0,2)! = "//"&& param.substr (O, l)! = "#"&&

! param. empty ()) {cout << "Ignoring unknown SOM parameter '" << param << "'")}} file.closeO;

}

// public functions for retrieving each parameter's value size_t getlNbOfNeuronsX () const {return ivNbNeuronsX; } size_t getNbOfNeuronsY () const {return ivNbNeuronsY; } size_t getNbOfSOMExpansions () const {return ivNbExpansions; } double getMaxNeighborDistQ const {return ivMaxNeighborDist; } double getLearningRateO const {return ivLeamingRate; } size_t getMaxMemSizeInMB () const {return ivMaxMemSizelnMB; } const DCString & getModelName () const {return ivModelName; } const DCString & getTempDirectory () const {return ivTempDir; }

private: // private member variables

size_t ivNbNeuronsX; size_t ivNbNeuronsY; size_t ivNbExpansions; double ivMaxNeighborDist; double ivLeamingRate; size_t ivMaxMemSizelnMB; string ivModelName; string ivTempDir;

};

The SOMTraining class is a major class of implementation with network expansion. After assigning a parameter object and one or more data processing objects by performing several network expansion steps, the class trains a SOM network (method trainSOM ()) and outputs the trained network to a file.

The SOMTraining class contains four internal methods for network expansion and reduction:

One method, initializeNeighborhood (), compiles the topology and neighborhood information for a given neural network size. This can be used to determine which neuron is at what distance to which neuron is adjacent.

A second method, initializeSOMNetwork (), uses a heuristic to select seeds for the neuron weights of the smallest, coarsest SOM mesh.

A third method, expandSOMNetwork () 'performs a network expansion step of nχ-ny neurons on (2nx + l) - (2ny + l) neurons.

A fourth method, shrinkSOMNetwork () 'performs a mesh reduction step of (2nx + 1) "(2ny + 1) neurons on nx'ny neurons. All of the aforementioned methods, with the exception of the method, trainSOM () ', are described in detail below. The method 'trainSOM ()' is shown below in its implementation.

class SOM training

{public: // public methods interface

// constructor

SOMTraining (const SOMParameters & params, const vector <DataPage *> &data);

// Training of the SOM * / bool trainSOMO;

// write the SOM, i.e. the neuron coordinates and weights, csv data file bool writeCSVFileO const;

private: // private methods

// fiil the array pivNeighborhood with topological neighborhood infos, using // the current values of nbNeuronsX, nbNeuronsY, and invMaxNeighborDist. bool initializeNeighborhood ();

// choose initial neuron values for each of the normalized fields void initiaiizeSOMNetworkO;

// increase the number of neurons by inserting new neurons between the

// existing neurons * / bool expandSOMNetwork ();

private: // private member variables

// const references to the external objects used in the constructor const DataDescription &ivDescr; const vector <DataPage *>&ivData; const SOMParameters &ivParams; // properties of the training data size_t ivNbRecords; size_t ivNbNumFIds; // numeric fields size_t ivNbBinFIds; // binary (boolean) fields size_t ivNbNomFIds; // nominal fields

// in SOM, each nominal field is expanded into normalized fields, where n is

// the number of the field's valid field values. 'ivNbNormalizedFIds' is the // total number of normalized fields, including the numeric and binary fields size_t ivNbNormalizedFIds;

// array of length 'ivNbNomFIds ¹ which returns for each nominal field the

// number of valid values of this field size_t * pivNbNomValues;

// current number of neurons (in X and Y direction and total) size_t ivNbNeuronsX; size_t ivNbNeuronsY; size_t ivNbNeurons;

// number of neural network expansion steps to be performed. Each expansion

// step increases ivNbNeurons [X / Y] to 2 * ivNbNeurons [X, Y] + l. size_t ivNbExpansions;

// the total number of neurons after the last expansion step size_t ivMaxNbNeurons;

// array of length ivMaxNbNeurons * ivNbNormalizedFIds, contains the properties // of all normalized fields in all neurons (inner index) double * pivSOM;

NeighborDistance: public pair <size_t, double> {NeighborDistance (size_t new = 0, double dist = O)

: pair <size_t, double> (new, dist) {}; bool operator <(const NeighborDistance & d) {return second "± second | | second == d.second && first <d.first; }

}; typedef pair <size_t, NeighborDistance *> NeighborData; NeighborData * pivNeighborhood;

// array of length ivNbNeurons in which the distances between the current

// data record and each of the neurons are calculated double * pivDistances;

};

The method SOMTraining :: initializeNeighborhood ():

bool SOMTraining:: initializeNeighborhood ()

{// some initializations size_t maxDist = (size_t) (ivParams.getMaxNeighborDist ()); size_t maxDistSqr = (size_t) (ivParams.getMaxNeighborDist () * ivParams.getMaxNeighborDist ()); size_t iNeuron = 0; vector <NeighborDistance> tmpNeigh ((2 * maxDist + l) * (2 * maxDist + l));

// allocate the array of neighborhood info (length ivNbNeurons) pivNeighborhood = new NeighborData [ivNbNeurons]; if (IpivNeighborhood) return false; // error: out of memory;

// loop over all neuron and coordinates and determine all neighbors

// y-coordinates within maxDist for (size_t iY = 0; iY <ivNbNeuronsY; iY ++) {size_t yMin = (iY> = maxDist)? iY-maxDist: 0; size_t yMax = (iY + maxDist <ivNbNeuronsY)? iY + maxDist: ivNbNeuronsY-1;

// loop over all neuron x coordinates and determine all neighbors

// x-coordinates within maxDist ... for (size_t iX = 0; iX <ivNbNeuronsX; iX ++, iNeuron ++) {size_t xMin = (iX> = maxDist)? iX-maxDist: 0; size_t xMax = (iX + maxDist <ivNbNeuronsX)? iX + maxDist: ivNbNeuronsX-1; // determine the number of neighbors with iπ maxDist and their neuron

// indexes and disturb them in the preliminary array tmpNeighbors.

NeighborData & neighbors = pivNeighborhood fiNeuron]; neighbors.first = 0;

5 for (size_t nX = xMin; nX <= xMax; nX ++) {for (size_t nY = yMin; nY <= yMax; nY ++) {sizej distSqr = (iX-nX) * (iX-nX) + (iY-nY ) * (iY-nY); if (distSqr> maxDistSqr | | distSqr == 0) continue; tmpNeigh [neighbors.first] .fιrst = nX + ivNbNeuronsX * nY; lo tmpNeigh [neighbors.first] .second = sqrt (distSqr); neighbors.first ++; }}

i5 // sort the temporary array of neighbors by ascending distance sort (tmpNeigh.beginO, tmpNeigh.begin () + neighbors.first);

// copy the neighbors from the preliminary to the final neighbor array neighbors.second = new NeighborDistance [neighbors.first]; 2o if (Ineighbors.second) {ok = false; break; } // out of memory for (size_t i = 0; kneighbors.first; i ++) neighbors.second [i] = tmpNeighp];

// end of loop over neuron coordinate x 25} // end of loop over neuron coordinate y

return true; }

The function inverseErf (double c) is a function not listed here in the complete implementation. This function calculates the inverse Gaussian error function erf ^ c). That is, the function calculates the interval width w to a given confidence c (where 0 <c <1) such that the integral of the Gaussian bell curve function G (x) = 1 / (V (2π) s) e ^{(x "). m) 2 / (2s2)} takes exactly the value c over the interval [mw * s, m + w * s].

35 Special values of erf ^ c) are:

• erf ¹ CO-O) = 0.0

• erf ^ O.βδS) = erf ^ probability for x to be in [m-Ts, m + ls] = 1.0 • erf ^ 0.95S) = erf ^ probability for x to be in [m-2-s, m + 2-sj) = 2.0

• erf ¹ CO. ^?) = Erf ^ probability for x to be in [m-3-s, m + 3-s] = 3.0

• erf ¹ Cc ^ l) - »co

The method SOMTraining :: initializeSOMNetwork ():

void SOMTraining:: initializeSOMNetwork ()

{srand (0); const double rand_denom = 1 / RAND_MAX; double * p = pivSOM; valarray <double> sumNom (1st, ivNbNeurons);

// cases 1 + 2: numeric and binary fields for (size_tfld = O; fld <ivNbNumFlds + ivNbBinFlds; fld ++) {for (size_t n = 0; n <ivNbNeurons; n ++, p ++) {int rndjnt = rand (); double rnd = rand_denom * rndjnt;

// case 1: numeric field: choose a normally distributed random number // with mean = 0 and stdDev = 0.25. This makes sure that the maximum // difference between two values, 4 * sigma, is 1.0 and equals the maximum // difference between binary and nominal field values. if (fld <ivNbNumFlds) {const GaussianCompress & stats = ivDescr.getNumericStats (fld); * P = ((rndjnt & l)? 0.25: -0.25) * inverseErf (rnd); }

// case 2: binary field: choose a random probability of the 'yes' value else if (fld <ivNbNumFlds + ivNbBinFlds) {

* p = rnd; }}}

// case 3: nominal fields: choose equally distributed random probabilities // between 0 and 2 / nbValues for all values, observing the requirement that // all values' probabilities must sum up to 1. for (size_tfld = O; fld <ivNbNomFlds; fld ++) {size_t i = 0; for (i + l <pivNbNomValues [fld]; i ++) {for (size_t n = 0; n <ivNbNeurons; n ++, p ++) {int rndjnt = rand (); double rnd = rand_denom * rndjnt;

* p = rand_denom * rand () * 2. * sumNom [n] / (pivNbNomValues [fld] - i); sumNom [n] - = * p;

}} for (size_t n = 0; n <ivNbNeurons; n ++, p ++) {

* p = sumNom [n]; sumNom [n] = 1;

}}}

The method SOMTraining :: expandSOMNetwork ():

bool SOMTraining:: expandSOMNetwork ()

{// deallocate the existing neighborhood info of the old, small network if (pivNeighborhood) {for (size_t i = 0; i <ivNbNeurons; i ++) {NeighborData & p = pivNeighborhood [i]; delete [] p.second; } delete pivNeighborhood; }

// expand the network by inserting new neurons around each existing neuron at // helped the old neuron distance both in x and in y direction size_t nbNeuronsXOld = ivNbNeuronsX; size_t nbNeuronsYOId = ivNbNeuronsY; size_t nbNeuronsOld = ivNbNeurons; ivNbNeuronsX = 2 * ivNbNeuronsX + 1; ivNbNeuronsY = 2 * ivNbNeuronsY + 1; ivNbNeurons = ivNbNeuronsX * ivNbNeuronsY;

// update the neighborhood info if (initializeNeighborhood ()! = True) {return false; // to error occurred, e.g. out of memory}

// create a new net of double neuron density. The newly added neurons '// properties are linear interpolations of the two nearest existing neurons' // properties.

// First, we copy the existing neurons' properties into the larger net.

// Note: we have to go through all neuron indexes and fields in reverse // direction, otherwise we would overwrite data which is needed later. for (int fld = ivNbNormalizedFlds-l; fld> = 0; fld ~) {for (int iY = nbNeuronsYOId-l; iY> = 0; iY ^~ ) {for (int iX = nbNeuronsXOId-l; iX> = 0; iX-) {size_t iOld = iX + nbNeuronsXOId * iY; size_t iNew = 2 * iX + l + ivNbNeuronsX * (2 * iY + l); pivSOM [iNew + fld * ivNbNeurons] = pivSOM [iOld + fld * nbNeuronsOld]; }}

}

// At this stage, the existing neurons are located at odd x and y coordinates.

// Now we have to calculate the inserted neurons' properties, i.e. the // properties of the neurons with at least one even coordinate.

// We instead of the neurons with odd and even x coordinate. Thesis neurons have

// existing neurons as positions (x-l, y) and (x + l, y) whose properties we can

// interpolate. double * const pStop = pivSOM + ivNbNormalizedFlds * ivNbNeurons; double * const pBinStart = pivSOM + ivNbNumFlds * ivNbNeurons; double * const pNomStart = pivSOM + (ivNbNumFlds + ivNbBinFlds) * ivNbNeurons; for (int iY = l; iY <ivNbNeuronsY; iY + = 2) { double * p = pivSOM + iY * ivNbNeuronsX;

// special case: neurons with x == 0.

// Here, we have no left neighbor, therefore we extrapolate the properties of // the first and second neighbors to the right, i.e. (x + 1, y) and (x + 3, y):

// properties (x, y): = 1.5 * properties (x + l, y) - 0.5 * properties (x + 3, y). // First, we calculate the numeric field values for (; p <pBinStart; p + = ivNbNeurons)

* p = 1.5 * * (p + 1) - 0.5 * * (p + 3); // For binary and nominal fields, the above formula for properties (x, y) has to

// be modified because we have the additional constraints that all results must // be between 0 and 1, whereas the above formula might produce values <0 or> 1. // The correction for binary fields is simple: replace <0 by 0 and> 1 by 1. for (; p <pNomStart; p + = ivNbNeurons) {* p = 1.5 * * (p + l) - 0.5 * * (p +3); if (* p> 1.) * p = 1 .; else if (* p <0.) * p = 0;

}

// The correction for nominal fields is more difficult because we have the // additional constraint that all values' probabilities must sum up to 1. // Therefore, we start with the extrapolation formulas d: = 0.5 and

// properties (x, y): = (1 + d) * properties (x + 1, y) - d * properties (x + 3, y) // and reduce all generated probabilities are in the rank [0 ..I]. // The constraint that all probabilities sum up to 1 is always fulfilled. size_t nomFId = 0; double d = 0.5; for (size_t nomVal = 0; p <pStop;) {

// calculate the current extrapolated value. Check whether it is in [0,1] * p = (1. + d) * * (p + 1) - d * * (p + 3); if (* p> l. | | * p <0.) {// find the maximum d which leaves all probabilities in the valid rank d = (((* p> l)? 1 .: 0.) - * ( p + 1)) / (* (p + 1) + * (p + 3)); // ... and instead of recalculating all probabilities for the current field // by resetting pointers to the first value of the current field. p - = nomVal * ivNbNeurons; nomVal = 0;

} else { nomVal ++; p + = ivNbNeurons; if (nomVal == pivNbNomValues [nomFld]) {// we are done with all values of the current field. Go to next field. nomFld ++; nomVal = 0; d = 0.5; }

>}

// general case: neurons with 0 <x <nbNeuronsX-1. // Here, we have two neighbors between which we can interpolate:

// properties (x, y): = 0.5 * properties (x-l, y) + 0.5 * properties (x + l, y). // Therefore, the Situation <0 or> 1 can never occur for binary or nominal // fields, and no special treatment for these fields is needed. for (int iX = 2; iX + KivNbNeuronsX; iX + = 2) for (p = pivSOM + iX + iY * ivNbNeuronsX; p <pStop; p + = ivNbNeurons) {

* p = 0.5 * (* (p-1) + * (p + 1)); }

// special case: neurons with x == nbNeuronsX-l // This case is treated analogously to the special case x == 0. p = pivSOM + (iY + 1) * ivNbNeuronsX - 1; for (; p <pBinStart; p + = ivNbNeurons) // numeric fields

* p = 1.5 * * (p-1) - 0.5 * * (p-3); for (; p <pNomStart; p + = ivNbNeurons) {// binary fields * p = 1.5 * * (p-1) - 0.5 * * (p-3); if (* p> 1.) * p = 1 .; else if (* p <0.) * p = 0;

} nomFId = 0; d = 0.5; for (size_t nomVal = 0; p <pStop;) {// nominal field

* p = (1. + d) * * (p-1) - d * * (p-3); if (* p> l. | | * p <0.) {

// find the maximum d which leaves all probabilities in the valid rank d = (((* p> l)? 1 .: 0.) - * (pl)) / (* (pl) + * (p-3) ); // ... and recalculate all probabilities for the current field p - = nomVal * ivNbNeurons; nomVal = 0; } else {nomVal ++; p + = ivNbNeurons; if (nomVal == pivNbNomValues [nomFld]) {nomFld ++; nomVal = 0; d = 0.5;

}}} // end of loop over neurons' y coordinates

// Next, we calculate new neuron's properties for even and coordinates. // At this stage, valid neuron properties exist for both even and odd x and for // odd y coordinates. That means, for each pair (x, y), y even, we can calculate: // properties (x, y): = 0.5 * properties (x, yl) + 0.5 * properties (x, y + l), resp , // properties (x, 0): = 1.5 * properties (x, l) - 0.5 * properties (x, 3), resp.

// properties (x, nY-l): - 1.5 * properties (x, nY-2) - 0.5 * properties (x, nY-4).

// special case: neurons with y == 0 for (int iX = 0; iX <ivl \ lblMeuronsX; iX ++) {double * p = pivSOM + iX; for (; p <pBinStart; p + = ivNbNeurons) // numeric fields

* p = 1.5 * * (p + ivNbNeuronsX) - 0.5 * * (p + 3 * ivNbNeuronsX); for (; p <pNomStart; p + = ivNbNeurons) {// binary fields

* p = 1.5 * * (p + ivNbNeuronsX) - 0.5 * * (p + 3 * ivNbNeuronsX); if (* p> 1.) * p = 1 .; else if (* p <0.) * p = 0;

} size_t nomFId = 0; double d = 0.5; for (size_t nomVal = 0; p <pStop;) {// nominal fields * p = (1. + d) * * (p + ivNbNeuronsX) - d * * (p + 3 * ivNbl \ leuronsX); if (* p> 1. I l * p <0.) {

// find the maximum d which leaves all probabilities in the valid rank d = (((* p> l)? 1 .: 0.) - * (p + ivNbNeuronsX)) /

(* (p + ivNbNeuronsX) + * (p + 3 * ivNbNeuronsX)); // ... and recalculate all probabilities for the current field p - = nomVal * ivNbNeurons; nomVal = 0; } else {nomVal ++; p + = ivNbNeurons; if (nomVal == pivNbNomValues [nomFld]) {nomFld ++; nomVal = 0; d = 0.5;

}}}}

// general case: neurons with 0 <y <nbNeuronsY for (int iY = 2; iY + l <ivNbNeuronsY; iY + = 2) for (int iX = 0; iX <ivNbNeuronsX; iX ++) {for (double * p = pivSOM + iX + iY * ivNbNeuronsX; p <pStop; p + = ivNbNeurons) * p = 0.5 * (* (p + ivNbNeuronsX) + * (p-ivNbNeuronsX));

}

// special case: neurons with y == nbNeuronsY for (int iX = 0; iX <ivNbNeuronsX; iX ++) {double * p = pivSOM + iX + ivl \ lbNeuronsX * (ivNbNeuronsY-l); for (; p <pBinStart; p + = ivl \ lbNeurons) // numeric fields

* p = 1.5 * * (p-ivNbNeuronsX) - 0.5 * * (p-3 * ivNbNeuronsX); for (; p <pNomStart; p + = ivNbNeurons) {// binary fields

* p = 1.5 * * (p-ivNbNeuronsX) - 0.5 * * (p-3 * ivNbNeuronsX); if (* p> 1.) * p = 1 .; else if (* p <0.) * p = 0;

} size_t nomFId = 0; double d = 0.5; for (size_t nomVal = 0; p <pStop;) {// nominal fields * p = (1. + d) * * (p-ivl \ lbNeuronsX) - d * * (p-3 * ivNbNeuronsX); if (* p> 1. I l * p <0.) {

// find the maximum d which leaves all probabilities in the valid rank d = (((* p> l)? 1 .: 0.) - * (p-ivNbNeuronsX)) /

(* (p-ivNbNeuronsX) + * (p-3 * ivNbNeuronsX)); // ... and recalculate all probabilities for the current field p - = nomVal * ivNbNeurons; nomVal = 0; } else {nomVal ++; p + = ivNbNeurons; if (nomVal == pivNbNomValues [nomFld]) {nomFld ++; nomVal = 0; d = 0.5;

}}}}

return true; }

The shrinkSOMNetwork () method:

bool SOMTraining :: shrinkSOMNetwork ()

{

// deallocate the existing neighborhood info (old) larger network if (pivNeighborhood) {for (size_t i = 0; i <ivNbNeuroπs; i ++) {

NeighborData & p = pivNeighborhood [i]; delete [] p.second;

} delete pivNeighborhood; }

// by inserting new neurons around each existing neuron at // helped the old neuron distance both in x and in y direction size_t nbNeuronsXOId = ivNbNeuronsX; size_t nbNeuronsOld = ivNbNeurons; ivNbNeuronsX = ivNbNeuronsX / 2; ivNbNeuronsY = ivNbNeuronsY / 2; ivNbNeurons = ivNbNeuronsX * ivNbNeuronsY;

// update the neighborhood info if (initializeNeighborhood ()! = true) {return false; // an error occurred, eg out of memory }

// shrink the network. The neurons with odd x and y coordinates will remain // all other neurons will be dropped, but before dropping them, their properties // are merged into the properties of the remaining neurons. for (size_tfld = 0; fld <ivNbNormalizedFIds; fld ++) {for (size_t iY = 0; iY <ϊvNbNeuronsY; iY ++) {for (size_t iX = 0; iX <ivNbNeuronsX; iX ++) {size_t iOId = (2 * iX + l) + nbNeuronsXOId * (2 * iY + 1); size_t iNew = iX + ivNbNeuronsX * iY; pivSOM [iNew + fld * ivNbNeurons] = 0.25 * pivSOM [iOld + fld * nbNeuronsOld] + 0.125 * pivSOM [iOld-l + fld * nbNeuronsOld] + 0.125 * pivSOM [iOld + l + fld * nbNeuronsOld] + 0.125 * pivSOM [ iOld-nbNeuronsXOId + fld * nbNeuronsOld] + 0.125 * pivSOM [iOld + nbNeuronsXOId + fld * nbNeuronsOld]

+ 0.0625 * pivSOM [iOld-l-nbNeuronsXOId + fld * nbNeuronsOld] + 0.0625 * pivSOM [iOld-l + nbNeuronsXOId + fld * nbNeuronsOld] + 0.0625 * pivSOM [iOld + l-nbNeuronsXOId + fld * nbNeuronsOld] + 0.0625 * pivSOM [ iOld + l + nbNeuronsXOId + fld * nbNeuronsOld]; }

}}}

A method SOMTraining :: writeCSVFile () outputs the trained network (neuron positions and neuronal weights) to a file, here a csv file.

bool SOMTraining :: writeCSVFile () const

{// open the output file string fileName (ivParams.getModelName ()); fileName + = ".csv"; ofstream fill (fileName.c_str ()); if (! file.is_open ()) {cout <<"Unable to open output file"'<< fileName <<""'<<endl; return false; } // write the header row of the output fiie file <<"neuron_x"<<","<<"neuron_y"; for (size_t i = 0; i <ivNbNumFlds; i ++) file <<","<< ivDescr.getFieldName (i); for (size_t i = 0; i <ivNbBinFlds; i ++) {file <<"," ivDescr.getFieldName (i + ivNbNumFIds); file <<"_"<< ivDescr.getFirstBinaryValue (i);

} for (size_t i = 0; i <ivNbNomFlds; i ++) {for (size_t j = 0; j <pivNbNomValues [i]; j ++) {file << "," << ivDescr.getFieldName (i + ivNbNumFIds + ivNbBinFIds) ; file << "_" << ivDescr.getNominalValue (i, j); }} file << endl;

// write the data rows of the output file size_t i = 0; for (size_t x = 0; x <ivNbNeuronsX; x ++) {for (size_t y = 0; y <ivNbNeuronsY; y ++, i ++) {file << x << "," << y; double * pNeur = pivSOM + i; for (size_t j = 0; j <ivNbNormalizedFlds; j ++, pNeur + = ivNbNeurons) {double value = * pNeur; if (j <ivNbNumFIds) {// denormalize numeric values const GaussianCompress & stats = ivDescr.getNumericStats (j); value = value * 4. * stats. getStdDev () + stats.getMean ();

} file << "," << value;

} file << endl;

}}

return file.good (); } Currently available software packages, for example DataCockpit 1.03, use the following principle to train a SOM network:

For all iterations (for example, about 200) {

For all training data sets d (about 10 ⁴ -10 ⁹ ) {minimum_distance: = 10 ³⁰⁰ best_Neuron: -1 For all neurons n (about 1000) {distance_d_n: = 0

Weight_n: = weightfn]

For all (normalized) features m (for example approx. 5 - 300) {If d [m] is present and valid {distance_d_n + = (weight_n [m] - d [m]) ² * weighting factor [m]}

}

If distance_d_n <minimum_distance {minimum_distance: = distance_d_n best_Neuron: = n}

}

Shift the weights of best_Neuron and its neighbors toward d

}

Change learning rate and maximum neighborhood radius}

Here, in each iteration step and for each record, all the neurons of the network are traversed, and for each neuron the Euclidean distance between its weights and the normalized feature values of the data set is calculated. Then the neuron with the lowest distance is determined, and this and its neighbor neurons are adjusted in their weights towards the data set. The italic or bold parts of the pseudo-code are the computational-critical parts; most of the computational time is consumed in the boldest innermost loop over all normalized features.

A correction factor valence factor [m] is achieved by creating a plurality of normalized features from an original nominal feature, namely as many as the original feature has valid (valid) values. If this number is N, then all N normalized features resulting from this feature must be multiplied by the weight factor l / N. Normalized numeric and binary features, on the other hand, have a weighting factor of 1. This measure is designed to make the overall influence of a nominal field on the SOM network normalize beyond that of a numerical or binary feature.

For example, a dataset of production data contains 10 million records and each record contains 10 numerical, 10 binary, and 10 nominal features. The nominal features each contain 10 different values. This dataset requires approximately (107 * (10 * 8 + 10 * 1 + 10 * 16) bytes of storage space, or about 2.5 gigabytes. "Examples of such data known in practice include production data in the engineering, chemical, automotive, and supplier industries or customer data in retail, finance or insur ^¬ insurance undertakings.

Out of the 30 original features, after normalization, 120 become normalized features. The algorithm presented in the example therefore requires approximately the following number of elementary operations (= CPU clock cycles) for 200 iterations and 30 * 40 neurons:

200 * 10 ⁷ * 30 * 40 * (120 * 15 + 20) = 4.37 * 10 ¹⁵ .

The expression in parentheses results from the fact that 11 elementary operations were performed on all normalized features within the 120 loop: 4 calculations (+, -, *, *), 4 field accesses ([]), one comparison (<), one Branching (if), as well as a loop counter increment and cancel test. The loop accesses three different fields and a scalar quantity: weight_n [m], d [m], weighting factor [m] and distance_d_n. Modern CPUs allow access to a field value [m] within one clock cycle. However, reading the field value unprepared from memory costs at least 4-5 clock cycles. In the present case, the three field accesses are in principle suitable for the pipelining method, since the access to each field is strictly sequential (ie in loop pass m at location m), but with the simultaneous provision of 3 different field values from three different long fields , plus a scalar, the pipelining heuristic should be overwhelmed, so that for at least one of the field accesses an access time of 4-5 clock cycles must be assumed. Together with the 11 elementary operations this results in the value of 15 clock cycles per loop pass. The additional 20 clock cycles in the expression in brackets are for requires the program code outside the innermost loop and for initializing and starting the loop itself, which requires about 10 clock cycles.

On a PC with a 2 GHz Intel CPU it takes about 25 days to process 4.37 * 10 ¹⁵ clock cycles.

A computing time of 25 days is unacceptably long in practice, so that with the current state of the art, the SOM analysis can not be used in many important applications, even for only moderately large data sources.

Some factors such as the number of data records and the number of neurons are given and therefore not accessible to optimization. Also, the fact that 4 loops are needed (over the iterations, datasets, neurons, features) can not be changed. Acceleration approaches must therefore start with the other terms in the product 200 * 10 ⁷ * 30 * 40 * (120 * 15 + 20):

The factor 'Convergence required number of iterations' (200) can be reduced by the multi-grid approach. A reduction of the computational effort by about a factor of 4 is possible by the proposed technique.

In the case of the factor "normalized features" (120), the innermost loop can only run over the original features by a conversion of the instructions.

The factor 'clock cycles per innermost loop pass' (15) can be helped by an algorithm and memory structure switch, by which, for example, the if query can be retrieved from the innermost loop. In addition, in the innermost loop, the number of different swept fields can be reduced from 3 to 2.

With the factor around the innermost loop and loop initialization factor (20), it is advantageous to interchange the loop order. For data with few features and / or predominantly numeric and / or binary features, the innermost loop can often consist of only 5-10 passes. This is numerically unfavorable. High computational throughput is made possible by long innermost loops.

In the following, a new technique is introduced, which represents a considerable improvement over the known approaches and which implements the multi-lattice approach. The proposed procedure is given below in pseudocode form. The line numbers are commentary marks and are illustrative with reference to the above-mentioned ^prior art known approach. The operation is shown schematically in the Rg. 3 and 4.

For all iteration steps i (eg approx. 200) {For all training data sets d (approx. 10 ⁴ - 10 ⁹ ) {minimum_distance: = 10 ³⁰⁰ bestes_Neuron: = -1

(1) Set all distance_to_d [n]: = 0

(2) For all numerical features m (for example, about 0-50) {(3) If d [m] is present and valid {

(4) value: = d [m]

(5) Weight_m: = weight [m]

(6) For all neurons n (e.g., about 1000) {

(7) distance_to_d [n] + = (weight_m [n] - value) ² }

}} For all binary features m (e.g., about 0-50) {

If d [m] is present and valid {(8) Value: = (d [m] == 0)

Weight_m: = weight [m]

For all neurons n (about 1000) {

Distance_to_d [n] ^■ + - = (weight_m [n] - value) ² }}

}

(9) For all nominal features m (e.g., approximately 0-50) {If d [m] is present and valid {

(10) Weight jn: = weight [normalized index [m] + d [m]] For all neurons n (e.g., about 1000) {

(11) distance_to_d [n] + = (weight_m [n] - I) ² }

}} (12) For all neurons n (for example, about 1000) {

If distance_to_d [n] <minimum_distance {minimumJdistance: = distance_to_d [n] best_Neuron: = n

}} Shift the weights of best_Neuron and its neighbors towards d}

}

To (1): The two innermost loops are reversed compared to the previous technique. Therefore, no scalar variable is needed to store the cumulative distance between the neuron and the data set over all features, but only a field of the length 'number of neurons'.

To (2): The loop runs over the number of numeric characteristics. By a loop change over the previous technology, the three feature types can be treated separately, which additionally enhances the performance of the new technology over the previous one. It also reduces the complexity of the innermost loop.

To (3): In the prior art, the test results for the presence of the feature value in the data set in ^¬ a deceleration, which requires in the innermost loop at least 3 clock cycles. In the new technique, the same test is significantly accelerated because it takes place outside of an inner loop and may make it unnecessary to loop through it.

To (4): The current feature value from the record is an invariant of the innermost loop. The field access can be done once outside the loop. This cycles in and at the same time reduces the number of different fields that must be worked with in the innermost loop from 3 to 2. The pipelining heuristic of the CPU can thus determine the field values of the remaining two fields more quickly.

To (5): One aspect of the new technique is that it works more efficiently due to a reorganized storage structure of the neuron weights. In previous SOM implementations, all weights of a neuron are stored in a contiguous memory area. In the prior art, the feature weights of a single neuron are stored "scattered" in memory and the weights of all neurons for a given feature are stored in a contiguous memory area. To (6), (7): The innermost loop is redesigned for efficient number processing by a CPU. Because of the length of the loop (≥IOOO, regardless of the Datencharakteris ^¬ tik) of the loop overhead and the entire amount of computation does not fall into the weight outside the loop. In addition, the code within the loop is kept very simple and consists of only one parallel sequential pass through two floating point number fields and a few elementary floating point operations on the field values. On vector computer architectures, the loop can be well vectorized, allowing the entire loop to be processed in a few clock cycles instead of several thousand clock cycles.

To (8): The treatment of the binary feature values within the loop over all binary features implies that a 'true' or 1 value in a compressed store has the value index 0, a 'false', or O value index 1. A neuron weight of x (0 <x <1) indicates that the neuron expects x, true, or 1 value.

To (9), (10): The loop runs over the number of original nominal features. However, since the matrix of neuron weights still contains several different weights per nominal feature, the normalized feature index associated with the current feature value index d [m] of the original feature m in (10) must be calculated using an auxiliary field. This auxiliary field, normalized_index [m], can determine the position (index) of the normalized feature that corresponds to the value of index 0 of the original feature m. If d [m] is added to this index, the correct index of the normalized feature belonging to the original feature m and its value index d [m] is generated.

To (11): Through the preliminary work in (9) and (10), even in the case of nominal features, the innermost loop over all neurons is simple. In particular, a corrective feature-dependent weighting factor is no longer required since only one loop pass and distance contribution are calculated for each nominal feature.

To (12): The determination of the neuron with the least square distance to the data set now happens at a position outside of two inner loops, so that the computational effort applied for this does not matter.

The distances that are calculated and used do not have to be square distances. Rather, it is only important that the distances each have the properties of a metric. A metric is a mathematical function, two each Assigns a non-negative real value to elements of a space, which can be understood as the distance between the two elements.

In comparison to the considerations for the previous technique and assuming that 5% of all feature values in the data are not present or invalid, the new algorithm yields the following estimate of the required clock cycles:

200 * 10 ⁷ * (3 * 9.5 * (15 + 1200 * 6) + 200 + 1200 * 4 + 10000) = 4.4 * 10 ¹⁴ .

The following is an explanation of some components of the calculation:

• 200: Estimated cost of zeroing the field of distances, for example, by fast memory block copying and all other non-loop operations performed once per record.

1200 * 4: the final loop (12) to find the best neuron. «3: The three feature types are processed one after the other.

• 9.5: There are 10 characteristics for each feature type. If the characteristic value is missing in the current data record - which is the case with 5% probability, the further code need not be run through - see condition (3).

• 15: Effort for preparing and starting loops across all neurons ((4), (5), (6)).

• 1200 * 6: 6 clock cycles per loop pass through the neurons (7): two sequential field accesses, three elementary arithmetic operations (+, -, *), one loop count increment.

• 10000: estimate the effort needed to modify the earnings neuron and its neighbors in their weights.

Thus, the computing time for the example of about 25 days to about 2.5 days, ie by about a factor of 10.

In addition, if the multi-lattice approach is applied and this technical measure results in an acceleration factor of 4, then the calculation time of the above example is approximately 14 hours. That is, an analysis may e.g. be done in one night and the result will be fed back into the production process the very next day.

In the above example, the fact that a third of the features are nominal features that have relatively many different values contributed a factor of 4 to the overall factor of 40. For purely numerical and binary data, the speed gain decreases next to about the factor 10-12. However, then often another technical effect occurs: the loop over all normalized features is quite short in the absence of nominal features: about 5 - 20 runs. As a result, the overhead present in the previous procedure around the innermost loop takes up a considerable part of the computing time. This negative effect of the short inner loop is eliminated in the new technique, so that the overall speed gain increases to a factor of 15 or more.

Thus, the proposed technique is expected to improve by a factor of between about 15 (data without nominal features) and up to 100 (pure nominal data with many values per feature). The comparison calculations between the technology proposed here and the state of the art in the form of the DataCockpit 1.03 implementation confirm this.

The new technique can also be represented in a pseudo-code formulation as follows: ^■ Define m: = number of features (columns) in the training data ■ For all changes in neuron network size o Define n: = number of neurons in the network ( eg 100-1000) o Write the neuron weights into a 2-dimensional numeric field called 'Weights' with the dimensions m and n. o For all iteration steps (eg approx. 10 - 200)

^■ For all data records or a subset of the training data sets (eg 10000 - all)

• Define ds: = index (position) of the current data set in the selected subset • Allocate a field variable _» distances« of length n, which will contain the distances between the current data record and all neurons. Set all initial values in this field to 0.

• Define a numeric variable minimum_distance and set it to a very large value, for example the maximum representable floating point number.

• For all features m that have a valid value w [m] in the dataset ds o For all neurons n Add the distance between weights [m] [n] and w [m] to the field element Distances [n]. • For all neurons n, for which applies: distances [n] <minimum_distance o minimum_distance: = distances fn] o best neuron: = n • For all features m that have a valid value w [m] in the dataset ds o Shift the weights, weights [m] [bestes_Neuron], and the weights of certain neighbor neurons of the best_Neuron, weights [m] [nachbar_Neuron] in the direction of w [m].

The first loop over all features in the flowchart can be replaced by multiple loops, each of which iterates over only a portion of all features. For example, a loop can iterate over numeric, loop over binary and / or loop over textual features. This may be advantageous because the nature of the distance calculation between weights [m] [n] and w [m] may be differently defined for different feature types and time-consuming branches (if-then queries) can be saved in the innermost, most computationally intensive, loop can.

The field of neuron weights, called 'weights' in the above flowchart, can be implemented in particular such that it consists of m gapless sequences of every n numerical field cells. This is particularly efficient because all of the computation-intensive inner loops that access weights [m] [n] then iterate over a gapless sequence of memory addresses. This approach is optimal for modern CPUs with pipelining architecture. Under certain circumstances, even more field accesses can be performed in one CPU clock cycle ('vectorization'). The distances can also be calculated particularly well by using this principle by means of at least one computer based on a pipelining architecture.

The distance calculation between the neuron weights and the characteristic values of the current training data set can be done, for example, via the minimum square distance (Euclidean distance). However, any other distance dimensions can also be used.

The training data may have been compressed and indexed prior to entering the above schedule for faster access. For example, such preprocessing can replace textual values with integer value indices or discretize floating point values into discrete intervals. The presented storage and control flow organization is implemented in the example implementation, together with the multi-grid approach, in the method 'trainSOM ()' of the already introduced class 'SOMTraining'.

The method SOMTraining :: trainSOM ():

bool SOMTraining :: trainSOM ()

{DataRecord record (ivDescr.getNbOfNumericFields (), ivDescr.getNbOfCategoricalFields ()); bool ok = true;

// loop over all neural network expansion Steps while (ivNbNeurons <= ivMaxNbNeurons && ok == true) {

// loop over all iteration for a fixed network size size_t iteration = 0; double maxNeighborDist = ivParams.getMaxNeighborDist (); if (maxNeighborDist <1.) maxNeighborDist - 2.1; double invMaxNeighborDist = 1. / maxNeighborDist; double learningRate = ivParams.getLearningRate (); double changelnlteration = DBL_MAX;

while (changelnlteration> 0.01 * ivNbRecords && iteration <50 &&. ok) {iteration ++; changelnlteration = 0 .;

// loop over all training data records vector <DataPage *> :: constjterator pagelt = ivData.begin (); (* PageIt) -> initRetrievalMechanism (); for (size_t iRec = 0; iRec <ivNbRecords; iRec ++) {rc = (* pageIt) -> retrieveNextDataRecord (record); if (! ok) {pags ++; if (pagelt == ivData.end ()) {if (iRec + 1 <ivNbRecords) ok = false; ok ok = true; } else {

(* PageIt) -> initRetrievalMechanism (); ok = (* pageIt) -> retrieveNextDataRecord (record);

} if (! ok) break;

}

// clean up the squared distance buffer double * const pDistStop = pivDistances + ivNbNeurons; for (double * pDist = pivDistances; pDist <pDistStop; pDist ++) * pDist = 0 .;

// calculate squared distances between records and neurons for all numeric // fields double * pNeur = pivSOM; for (size_t JFId = O; iFld <ivNbNumFlds; iFld ++) {double value = record.getNormalizedNumericValue (iFId); if (value == DBL_MAX) pNeur + = ivNbNeurons; for (double * pDist = pivDistances; pDist <pDistStop; pDist ++, pNeur ++) {double dist = value - * pNeur; * pDist + = dist * dist; }>

// calculate squared distances between record and neurons for all binary

// fields for (size_t IFId = O; iFld <ivNbBinFlds; iFld ++) {double value = 1st - record.getCategoricalIndex (iFId); if (value <-0.01) pNeur + = ivNbNeurons; for (double * pDist = pivDistances; pDist <pDistStop; pDist ++, pNeur ++) {double dist = value - * pNeur; * pDist + = dist * dist;

}} // calculate squared distances between record and neurons for all nominal // fields for (size_t SFId = O; iFld <ivNbNomFlds; iFld ++) {unsigned char index = record. getCategoricalIndex (ivNbBinFIds + iFld); if (index! = 255) {pNeur + = ivNbNeurons * index; for (double * pDist = pivDistances; pDist <pDistStop; pDist ++, pNeur ++) {double dist = 1. - * pl \ leur; * pDist + = dist * dist;

} pNeur - = ivNbNeurons * (index + 1);

} pNeur + = ivNbNeurons * pivNbNomValues [iFld]; }

// find the neuron which mininizes the squared distance double minDistSqr = pivDistances [0]; double * pMinDist = pivDistances; for (double * pDist = pivDistances + l; pDist <pDistStop; pDist ++) if (* pDist <minDistSqr) {minDistSqr = * pDist; pMinDist = pDist;

} size_t iBest = pMinDist - pivDistances;

// move the winning neuron towards the record pNeur = pivSOM + iBest; for (size_t i = 0; kivNbNumFIds; i ++, pNeur + = ivNbNeurons) {double normVal = record. getNormalizedNumericValue (i); if (normVal! = DBLJMAX) {normVal - = * pNeur; normVal * = learningRate; * pNeur + = normVal; changelnlteration + = fabs (normVal);

}} for (size_t i = 0; i <ivNbBinFlds; i ++, pNeur + = ivNbNeurons) if (record.getCategoricallndex (i)! = 255) {double change = learningRate * (l.-record.getCategoricalIndex (i) - * pNeur); * pNeur + = change; changelnlteration + = fabs (change);

} for (size_t i = 0; kivNbNomFIds; i ++) {unsigned char index = record.getCategoricalIndex (i + ivNbBinFIds); if (index! = 255) {double * pNeurValue = pNeur + ivNbNeurons * index; for (size_t j = 0; j <pivNbNomValues [i]; j ++, pNeur + = ivNbNeurons) {* pNeur + = learningRate * ((pNeur == pNeurValue? l.: O.) - * pNeur);

} changelnlteration + = fabs (leamingRate * (l .- * pNeurValue)); }

}

// move the winning neuron's neighbors towards the record size_t nbNeighbors = pivNeighborhood [iBest] .first; const NeighborDistance * const neighbors = pivNeighborhood [iBest] .second; for (size_t n = 0; n <nbNeighbors; n ++) {const pair <size_t, double> & neigh = neighbors [n]; if (neigh.second> = maxNeighborDist) break; double factor = (1st - neigh.second * invMaxNeighborDist); factor * = factor * learningRate; pNeur = pivSOM + neigh.first;

for (size_t i = 0; i <ivNbNumFlds; i ++, pNeur + = ivNbNeurons) {double normVal = record. getNormalizedNumericValue (i); if (normVal! = DBL_MAX) {

* pl + leur + = factor * (normVal - * pNeur); }} for (size_t i = 0; i <ivNbBinFlds; i ++, pNeur + = ivNbNeurons) if (record. getCategoricalIndex (i)! = 255) {

* pNeur + = factor * (record. getCategoricallndex (i) - * pNeur); } for (size_t i = 0; i <ivNbNomFlds; i ++) {unsigned char index = record.getCategoricalIndex (i + ivNbBinFIds); if (index! = 255) {double * pNeurValue = pNeur + ivNbNeurons * index; for (size_t j = 0; j <pivNbNomVa! ues [i]; j ++, pNeur + = ivNbNeurons) {

* pNeur + = factor * ((pNeur == pNeurVa! ue? l.: O.) - * pNeur);

}}

}} // end of loop across all neighbors

} // end of loop over all training data records

// This sample code reduces both the likelihood and the maximum distance // (up to which neighbors of the winning neuron are modified) by a factor of

// 0.88 after each 8-th iteration.

// Many other modification schemes are possible here, e.g. smaller adjustment // after each single iteration, or different adjustment factors for leaming // rate and maximum neighbor distance. if ((iteration% 8) == 0) {maxNeighborDist * = 0.88; invMaxNeighborDist / = 0.88; learningRate * = 0.88;

}} // end of loop over all iteration for a fixed network size

// expand the neural network.

// This sample code implements the most elementary multi grid approach: we

// Start with the smallest, coarsest grid and perform a series of expansion // steps up the largest, finest grid of neurons is reached. No grid

// shrinking steps are performed.

// In the field of multi-grid approaches for solving large linear equation // Systems, have shown that intermixing grid expansion and grid // shrinking steps can increase overall speed of the method. // The same might hold for the multi grid approach to training SOM networks. if (ivNbNeurons <ivMaxNbNeurons && ok) {ok = expandSOMNetworkQ; } break;

> // end of loop over all network expansion Steps return ok; }

With the presented data preparation technology and SOM analysis, low-cost, readily available hardware can be used to build SOM analyzes for data up to about 10-20 GB in a reasonable amount of time, even if these computers have only one CPU core.

In addition, the proposed technique also allows parallelization of the SOM analysis process for even better response times. The following approaches are particularly suitable for the parallelization of the presented technique: • Vector computer

• SMP computer (shared memory parallelism)

• MPP calculator (Massively Parallel Shared-Nothing architectures)

• Networked computer clusters from several analysis computers (10, 20)

Vector computers are computers with a CPU ₇ but which have several vector registers for eg 128, 256 or 512 floating-point numbers (eg the SX supercomputer series from NEC). With the help of these vector registers elementary numerical operations between number fields (vectors) can be done in one clock cycle. The presented SOM training technique is designed to handle numerical computations in which the main part of the computation work inside long loops takes the form of elementary arithmetic operations between numeric arrays and scalar values, and where the computations of a loop pass are not different from the computation Depend on results of previous loop passes. Therefore, the proposed technique can be used virtually unchanged on vector computers and will provide there an almost linear velocity increase (by a factor of 100-200).

Even modern, multi-core computing architectures such as Intel multi-core processors or IBM CeII processors can already achieve an increase in speed through parallel processing. On a parallel computer with distributed memory or on a network of computers, the presented SOM training technique can be parseed with the help of a message passing interface such as MPI I or MPI II.

s The following modification is one way to accomplish this. The lines of code with [Mas ^¬ ter:] at the beginning are executed only by the master or coordinator process, all other lines of code from all parallel processes;

[Master:] Divide the compressed training data into many data processing objects. 10 [Master:] Send the data processing objects to different siaves

Read the data processing objects sent by the master into the main memory [Master:] construct initial configuration of the SOM network and training parameters Read the initial configuration and parameters sent by the master For all SOM network expansion steps (about 4 - 10) ls For all iteration steps (approx. 20-200) {

As long as there are still training data records available (about 10 - 1000 repetitions) {[Master:] send SOM network (weights), learning rate, radius, data recordsProStep Read the information sent by the master For data recordsProStep datasets d (about 10 ⁵ - 10 ⁶ ) {

20

(calculate distances between d and all neurons as in the serial case) (find best_Neuron as in the serial case)

Save winner's tower's index in the Winning list]

K}

Send List Winners to Master

(1) [Master:] Collect the lists Winners of all Siaves [Master:] For all Slave processes (about 4 - 128) {

[Master:] For all entries in the winner list n (approximately 10 ⁵ - 10 ^δ ) {30 [Master;] Read the corresponding training data set

[Master:] Modify the neuron weights (winner neuron + neighbors) [Master:]>

(2) [Master:]>}

35}

} Here, the adaptation of the SOM network is no longer synchronous (ie immediately after Ermit ^¬ teln the winning neuron for a record), but asynchronous. The individual slave processes determine the winner neurons for a given number of records and return the identifiers of those neurons to the master process. This collects the winner lists of all slave processes, then carries out all modifications of the neuron weights and sends the new SOM network back to the slave processes.

In order not to let the asynchrony be any size, see the shown above tech ^¬ technology before not to let work through their full range of data, the slave processes before sending the winners lists to the master process, but only a certain number of records ( eg 10000, 100000 or 1 million). As a result, despite the asynchronicity, the latest SOM network is communicated to all processes several times within an iteration and used as the basis for further calculations.

The need for communication between the individual processes is so low that this type of parallelization even works on a network of computers which are only connected via an Internet or Internet connection of DSL quality (6 Mbits / sec).

With the process presented above, for a small number of parallel processes, an approximately linear speed gain with the number of processes is possible. For a larger number of parallel processes, the speed gain converges to a constant factor between 20 and 30.

If this change in the speed increase of about 20-30 is to be overcome, the algorithm can be used e.g. modified so that the slave processes are already working on the basis of the next data record tranche, while the master process is still modifying the SOM network with the help of the collected winners lists of the previous installment. This means that the slave processes are always on the second most current network instead of the most up to date network determine their winning neurons.

It is clear that the processes (slave and master processes) can be carried out on different analysis computers (10, 20), e.g. are interconnected by a network or other data connection (14). In this case, at least one analysis computer (10), which executes the at least one master process, cooperates with at least one further analysis computer (20), on which at least one slave process is executed. This is shown schematically in FIG. 6.

Claims

claims

An electronic data processing system for analyzing data, comprising at least one analysis calculator (10), wherein the analysis calculator (10) is adapted and programmed to implement a self-adapting neuron network comprising a plurality of data sets training with many features is to be assigned to the neurons of the neuron network initial neuron weights, the neurons of the neuron network are attributed to neuron weights, which are to win the 10 variety of records with their many features, a training several Training phases, and wherein each training phase has a certain number of training runs, at the beginning of each training phase either

Neurons are to be inserted into the neuron network whose neuronal weights i5 result at least in part from weights of existing neurons, or

Neurons must be removed from the neuron network and the neuron weights of the remaining neurons at least partially weighted with parts of the weights of the removed neurons.

20 2. The electronic data processing system according to claim 1, characterized in that for each training phase, at least one selection of the data sets is used to weight the neuron weights of the neurons of the neuron network, wherein for each training phase depending on the current size of the neuron network different number of training runs chosen for the training of neurons with the characteristics

25, which is to be executed so often until the maximum predetermined number of training runs is reached, or the training converges in that the feature weights of the neurons no longer change significantly.

3. The electronic data processing system according to claim 1 or 2, characterized

30 illustrates that between two training phases for which neurons are to be inserted into the network, at least one training phase is to be performed for which neurons are to be removed from the network.

4. The electronic data processing system according to any one of claims 1-3, characterized in that in the removal of a neuron only the immediately adjacent to the neuron to be removed remaining neurons are to be re-weighted, or the remaining neurons by means of a linear or cubic or exponential -Spllne- Interpolation or any other interpolation rule involving multiple neighbor neurons.

5. The electronic data processing system according to any one of claims 1-4, characterized s in that the neurons of the neural network are to be arranged as nodes of a mehrdimensiona ^¬ len, preferably two-dimensional matrix.

6. The electronic data processing system according to the previous claim, characterized in that when removing or inserting neurons from / to the Neuro-o NEN-net from the matrix lines or columns to remove / insert.

7. The electronic data processing system according to any one of claims 1-6, characterized in that the weights of all neurons for a particular feature in a contiguous storage area of the analysis computer (10) are to be stored. 5

8. The electronic data processing system according to any one of claims 1-7, characterized in that the initial neuron weights of the neurons of the neuron network are to be determined by a heuristic method. 0

The electronic data processing system according to any one of claims 1-8, characterized in that the features are read once before the start of the training and the original features are transformed to purely numerical normalized features.

10. The electronic data processing system according to any one of claims 1-9, characterized in that the features are compressed before training as training data to save.

11. The electronic data processing system according to any one of claims 1-10, characterized in that the analysis computer (10) creates an initial configuration of the neuron network and training parameters and the initial configuration and the training parameters to at least one further analysis computer (20) shipped.

12. The electronic data processing system according to the preceding claim, characterized in that the initial configuration of the neuron network and the training parameters of the at least one further analysis computer (20) are read.

13. The electronic data processing system according to one of claims 1-12, characterized in that the analysis computer (10) for all training phases and / or for all training runs, the neuron weights and / or a learning rate and / or a radius and / or the number Sent from iterating steps to the at least one further analysis computer (20), the at least one further analysis calculator (20) reads in the information sent by the analysis computer (10) and for the number of training runs read in each case distances between the training data sets and the Calculates weights for the neurons, determines a winner neuron, stores the winning memorand each in a list and sends the list to the analysis calculator (10) after the number of iterations, and the analysis calculator (10) returns the list of winning neurons from the at least one receives another analysis calculator (20) and based on the weights of the winner neurons and their Nach barn modified.

14. The electronic data processing system according to any one of claims 1-13, characterized in that the training data from the analysis computer (10) are divided into data objects and the data objects sent to the at least one further analysis computer (20) be, where the data objects are to be dimensioned before sending so that they fit completely in the main memory of the at least one further analysis computer (20).

15. A method of training a neuron network comprising the steps of:

^■ storing the number of features (columns) in the training data in a first

Value, ■ Perform the following steps for any changes in neuron network size: o storing the number of neurons in the network in a second value, o storing initial neuron weights in a two-dimensional first field, with a first dimension of the field after the first value and a second dimension of the field determined after the second value, o perform the following steps for all iteration steps:

^» Perform the following steps for all datasets or a subset of training datasets:

• reserving a second field for storing distances between the current training data set and all neurons, • setting all values in this second field to a uniform initial value, • setting a value for a minimum distance to a vorbe ^¬ certain value which is selected so large that it is certainly greater than any tatächlichen distances between the current training ^¬ record and each neuron of the neural network, • performing the following steps for all characteristics that have a valid characteristic value in the current data record: o Perform the following step for all neurons:

■ adding the distance value between the neuron weight that is at a location of the first determined by the first value and the second value

Field and the valid feature value to a value at a location of the second field determined by the second value,

• Performing the following steps for all neurons for which the value stored at the second field location specified by the second value is less than the minimum distance value: o Setting the minimum distance to the value at the second Value is stored in the second field, o setting the current neuron as the best neuron,

• Performing the following steps for all features m that have a valid value in the current training data set: o shifting the neuron weights of the best neuron stored in the first field which correspond to features that are valid in the current training data set

The method of claim 15, characterized in that the neuron mesh size at the beginning of each training phase, either by inserting neurons into the neuron network whose neuron weights result at least in part from weights of existing neurons, increases the neuron network or by removing neurons from the neuron network, the neuron network is reduced, wherein the neuron weights of the remaining neurons in the removal are at least partially weighted with parts of the weights of the removed neurons.

17. The method of claim 15 or 16, characterized in that the features are numerical, Boolean and / or nominal features and the type of distance calculation for each of the feature types is done in different ways.

18. The method according to any one of claims 15-17, characterized in that the first loop is replaced by all the features by a plurality of loops.

19. The method according to any one of claims 15 - 18, characterized in that iterates a loop on numerical features and / or a loop on binary features and / or a loop on textual features.

20. The method according to any one of claims 15-18, characterized in that the first field is arranged so that it consists of a first value corresponding number of gapless sequences of each of the second value corresponding number of numeric field cells.

The method of any one of claims 15-20, characterized in that the distances between the neuron weights and the feature values of the current training data set are quadratic distances.

22. The method according to claim 15, characterized in that the training data are compressed and indexed before the beginning of the method, wherein textual values are discretized into discrete intervals by integer value indices and / or floating point values.

23. The method according to any one of claims 15 - 22, characterized in that at an enlargement of the neuron network, the weights of the newly inserted neurons is determined by linear, cubic or other interpolation, if it is internal Neuro ^¬ NEN and / or the weights of the newly inserted neurons are determined by extrapolation, if they are peripheral neurons.

The method of any one of claims 15-23, characterized in that, as the neuron network narrows, each neuron replaces several adjacent existing neurons and inherits in each of its neuron weights the mean of the corresponding neuron weights of the replaced neurons.

25. The method of claim 15, wherein the minimum distance between the neuron and the current training data set is the minimum square distance or the distance has a distance measure that has the properties of a metric.

26. The method of claim 15, wherein for each training phase, at least one selection of the training data sets is used to weight the neuron weights of the neurons of the neuron network, wherein for each training phase, depending on the current size of the neuron network A different number of training runs are selected for the training of the neurons with the characteristics that are to be executed so often until the maximum predetermined number of training runs is reached, or the training converges in that the feature weights of the neurons no longer change significantly.

27. The method according to any one of claims 15-26, characterized in that between two training phases for which neurons are to be inserted into the network, at least one training phase is to be executed for which neurons are to be removed from the network.

28. The method according to claim 15, characterized in that, in the removal of a neuron, only the remaining neurons immediately adjacent to the neuron to be removed are to be re-weighted, or the remaining neurons are relocated by means of a linear or cubic or exponential Spline interpolation or _% of another interpolation rule involving multiple neighbor neurons are to be re-weighted.

29. The method according to any one of claims 15-28, characterized in that the neurons of the neural network are to be arranged as nodes of a multi-dimensional, preferably two ^¬ dimensional matrix.

30. The method according to claim 29, characterized in that when removing or inserting neurons from / to the neuron network, rows or columns are to be taken / inserted from the matrix.

31. The method according to any one of claims 15 - 30, characterized in that the weights of all neurons for a particular feature in a contiguous storage area of an analysis computer (10) are to be stored.

32. The method according to any one of claims 15 - 31, characterized in that the initial neuron weights of the neurons of the neuron network are to be determined by a heuristic method.

33. The method according to one of claims 15-32, characterized in that the features are read once before the start of the training and the features are transformed only once to numerical features.

34. The method according to any one of claims 15 - 33, characterized in that the features are compressed before training as training data to save.

35. The method according to any one of claims 15-34, characterized in that the analysis computer (10) creates an initial configuration of the neuron network and training parameters and the initial configuration and the training parameters to at least one further analysis computer ( 20).

36. The method according to any one of claims 15 - 35, characterized in that the initial configuration of the neuron network and the training parameters of the at least one further analysis computer (20) are read.

37. The method according to any one of claims 15-36, characterized in that the analysis calculator (10) for all training phases and / or for all training runs, the neuron weights and / or a learning rate and / or a radius and / or the number of Iterationsschritten to the at least one further analysis computer (20), the at least one further analysis computer (20) by the analysis computer (10) einsiest sent information and for the counted training runs each distances between the training records and the weights for neurons calculation ^¬ net, a winning neuron, the winner neuron each stored in a list and the list after the number of iteration steps to the analysis calculator (10 ) and the analysis calculator (10) receives the list of winning neurons from the at least one further analysis calculator (20) and modifies the weights of the winning neurons and their neighbors based thereon.

38. The method according to any one of claims 15 - 37, characterized in that the training data 10 from the analysis computer (10) are divided into data objects and the data objects sent to the at least one further analysis computer (20) in which the data objects are to be dimensioned prior to shipping in such a way that they completely fit into the main memory of the at least one further analysis computer (20).

is 39. The method according to any one of claims 15-38, characterized in that the

Determining the distances between the neurons and the current training data set in the most computationally-intensive part of the process is accessed only on seamless sequences of memory fields.

40. The method according to claim 39, characterized in that for determining the distances between the neurons and the current training data set long loops over at most 2 memory field variables are used.

41. The method of claim 15, wherein for each nominal feature, nominal values occurring in the training data are stored in a directory in which a provisional index is assigned to each feature value and which additionally counts the occurrence frequency of a feature , and each notional value is replaced by the provisional index.

42. The method of claim 41, characterized in that the created directory is sorted by frequency of occurrence, a number of common values are each assigned to new index and the provisional indices are replaced by the new indices.

35