DE102008027605B4

DE102008027605B4 - System and method for computer-based analysis of large amounts of data

Info

Publication number: DE102008027605B4
Application number: DE102008027605A
Authority: DE
Inventors: Ansgar Dr. Dorneich
Original assignee: OPTIMING GmbH
Current assignee: Amyam De GmbH
Priority date: 2008-06-10
Filing date: 2008-06-10
Publication date: 2010-04-08
Anticipated expiration: 2028-06-11
Also published as: WO2009149926A3; WO2009149926A2; DE102008027605A1

Abstract

Elektronisches Datenverarbeitungssystem zur Analyse von Daten, mit
– wenigstens einem Analyse-Server (10) und
– wenigstens einem Vor-Ort-Client-Rechner (12), wobei
– der Analyse-Server (10) dazu eingerichtet und programmiert ist, ein selbst adaptierendes Neuronen-Netz zu implementieren, das auf eine große Datenbank mit einer Vielzahl Datensätzen mit vielen Merkmalen zu trainieren ist, wobei
– der Vor-Ort-Client-Rechner (12) dazu eingerichtet und programmiert ist, ihm zugeführte Daten
– einer Datenvorverarbeitung und/oder
– einer Datenkompression
zu unterziehen, bevor die Daten von dem Vor-Ort-Client-Rechner (12) über ein elektronisches Netzwerk (14) an den Analyse-Server (10) gesendet werden, wobei
– eine von dem Datentyp abhängige Kompression der Daten ausgeführt wird, die eine Anonymisierung der Daten dadurch bewirkt, dass die Daten in einen vertraulichen und einen nicht-vertraulichen Teil transformiert werden, der nicht-vertrauliche Teil der Daten an den Analyse-Server (10) übermittelt wird und der vertrauliche Teil der Daten auf dem Vor-Ort-Client separat gespeichert...Electronic data processing system for the analysis of data, with
At least one analysis server (10) and
- At least one on-site client computer (12), wherein
The analysis server (10) is adapted and programmed to implement a self-adapting neuron network to be trained on a large database having a plurality of records having a plurality of features, wherein
- The on-site client computer (12) is set up and programmed to him supplied data
A data preprocessing and / or
- a data compression
before the data is sent from the on-site client computer (12) to the analysis server (10) via an electronic network (14), wherein
A data-type-dependent compression of the data is carried out, which causes anonymization of the data by transforming the data into a confidential and a non-confidential part, the non-confidential part of the data to the analysis server (10) and the confidential part of the data is stored separately on the on-site client ...

Description

Hintergrundbackground

Derzeit verfügbare, kostengünstige Computerprogramme zur Datenanalyse (zum Beispiel DataCockpit^® 1.04) sind in der Analyse nennenswert langsamer als konkurrierende Data Mining Workbenches (SPSS und andere), können nur erheblich kleinere Datenmengen verarbeiten, und haben andere Nachteile (sie sind als monolithischer Block programmiert, sie sind in ihrer Architektur und Datenbehandlung ungeeignet zur Client-Server-Architektur, etc.).Currently available, inexpensive computer programs for data analysis (eg Data Cockpit ^® 1:04) are in the analysis significantly slower than competing data mining workbenches (SPSS and others), only significantly smaller amounts of data to process, and have (other disadvantages they are programmed as a monolithic block, they are unsuitable in their architecture and data handling for client-server architecture, etc.).

Zur Segmentierung oder zur Vorhersage werden Daten auf ein ein-, zwei- oder dreidimensionales selbstadaptierendes Neuronen-Netz (self organizing map, ,SOM') abgebildet. [T. Kohonen. Self-Organization and Associative Memory, vol. 8 of Springer Series in Information Science, 3rd edition, Springer-Verlag, Berlin, 1989].to Segmentation or prediction, data is split into one, two or three-dimensional self-adapting neuron network (self organizing map, 'SOM'). [T. Kohonen. Self-Organization and Associative Memory, vol. 8 of Springer Series in Information Science, 3rd edition, Springer-Verlag, Berlin, 1989].

Bei der der SOM-basierten Datenanalyse werden das sogenannte ,Kohonen Clustering' und die sogenannte SOM-Karten-Analyse unterschieden. Das Kohonen Clustering arbeitet nur mit sehr wenigen Neuronen, typischerweise zwischen etwa 4 und etwa 20 Neuronen. Jedes dieser Neuronen repräsentiert einen ,Cluster', also eine homogene Gruppe von Datensätzen. Diese Technik wird vor allem zur Datensegmentierung eingesetzt und ist in vielen Data Mining Softwarepaketen implementiert, zum Beispiel in SPSS Clementine oder IBM DB2 Warehouse. (siehe zum Beispiel Ch. Ballard et al., Dynamic Warehousing: Data Mining Made Easy, IBM Redbook, 2007).at the SOM-based data analysis will be the so-called Kohonen Clustering 'and distinguished the so-called SOM map analysis. The Kohonen Clustering works only with very few neurons, typically between about 4 and about 20 neurons. Each of these neurons represents a 'cluster', ie a homogeneous group of records. This technique is mainly used for data segmentation and is implemented in many data mining software packages, for example in SPSS Clementine or IBM DB2 Warehouse. (see for example Ch. Ballard et al., Dynamic Warehousing: Data Mining Made Easy, IBM Redbook, 2007).

Die SOM-Karten-Analyse benutzt demgegenüber relativ große Neuronennetze von zum Beispiel 30·40 Neuronen zur Datenanalyse. Hierbei werden homogene Datensegmente durch lokale Gruppen von Neuronen mit ähnlichen Merkmalsausprägungen repräsentiert. SOM-Karten werden zur Datenexploration, Segmentierung, Vorhersage, Simulation und Optimierung verwendet (siehe zum Beispiel R. Otte, V. Otte, V. Kaiser, Data Mining für die industrielle Praxis, Hanser Verlag, München, 2004).The In contrast, SOM map analysis uses relatively large neural networks from, for example, 30x40 Neurons for data analysis. This will be homogeneous data segments represented by local groups of neurons with similar feature values. SOM maps are used for data exploration, segmentation, prediction, Simulation and optimization (see, for example, R. Otte, V. Otte, V. Kaiser, Data Mining for Industrial Practice, Hanser Verlag, Munich, 2004).

Als Beispiele für weiteren technologischen Hintergrund seien die EP 0829809 B1 und die EP 0845720 B1 genannt.As examples of further technological background are the EP 0829809 B1 and the EP 0845720 B1 called.

Um eine umfangreiche, auf einem Computer zusammengetragene Datensammlung – zum Beispiel Produktionsdaten aus einer Fertigungsanlage mit etwa 10⁴ bis 10¹⁰ Datensätzen und etwa 3 bis 1000 Merkmalen pro Datensatz – zu analysieren und ggf. die Ergebnisse der Analyse in den Fertigungsablauf zurückfließen zu lassen, werden die vorhandenen Datensätze immer wieder einem lernenden und sich selbst adaptierenden Neuronen-Netz präsentiert.To analyze an extensive collection of data collected on a computer - for example, production data from a production facility with approximately 10 ⁴ to 10 ¹⁰ data sets and approximately 3 to 1000 characteristics per data record - and, if necessary, to return the results of the analysis to the production process the existing data sets are repeatedly presented to a learning and self-adapting neuron network.

Dabei kann es sich um Produktionsdaten in der Maschinenbau-, Chemie-, Automobil-, Zuliefererindustrie handeln: Zum Beispiel 10 Millionen produzierte Einheiten, 10 nominale Komponenten- und Produktionslinien-Informationen, 10 binäre Komponenten- und Ausstattungsinformationen, 10 nummerische Produktionsdaten (gemessene Toleranzdaten, Sensordaten, erfasste Produktionszeiten, Maschinendaten, ...) Ziel der SOM-Analyse ist hier die Qualitätssicherung, Fehlerquellenanalyse, Frühwarnung, Produktionsprozess-Optimierung. Ein anderes Beispiel wären Kundendaten in Einzelhandels-, Finanz- oder Versicherungsunternehmen: 10 Millionen Kunden, 10 nominale demografische Merkmale (Familienstand, Berufsgruppe, Region, Wohnungstyp, ...), 10 binäre Merkmale über Interessen und in Anspruch genommene Dienstleistungen/Produkte (Geschlecht; besitzt Kreditkarte; betreibt Online-Banking, ...), 10 nummerische Merkmale (Jahreseinkommen, Alter, Jahresumsatz, Kreditwürdigkeit, ...). Ziel der SOM-Analyse ist hier die Kundensegmentierung, die Vorhersage von Kundenwert, Kreditwürdigkeit, Schadensrisiko, ... sowie die Optimierung von Marketingkampagnen.there can be production data in mechanical, chemical, Automotive, supplier industry trade: For example, 10 million units produced, 10 nominal component and production line information, 10 binary Component and equipment information, 10 numerical production data (measured tolerance data, sensor data, recorded production times, Machine data, ...) The aim of the SOM analysis is quality assurance, Error source analysis, early warning, Production process optimization. Another example would be customer data in retail, financial or insurance companies: 10 million Customers, 10 nominal demographic characteristics (marital status, occupational group, Region, type of dwelling, ...), 10 binary features about interests and services / products used (sex; owns credit card; operates online banking, ...), 10 numerical Characteristics (annual income, age, annual turnover, creditworthiness, ...). The aim of the SOM analysis here is customer segmentation, the Prediction of customer value, creditworthiness, risk of damage, ... as well as the optimization of marketing campaigns.

Jedes Neuron des sich selbst adaptierenden Neuronen-Netzes hat so viele Signaleingänge, wie jeder der einzelnen Datensätze Merkmale hat. Hat das Neuronen-Netz die Daten ,gelernt', können mit dem trainierten Neuronen-Netz unter Anderem folgende Aufgaben abgearbeitet werden:

• Visuelle interaktive Datenexploration: Interaktives Entdecken von interessanten Untergruppen, Korrelationen zwischen Merkmalen und allgemeinen Zusammenhängen mit Hilfe von verschiedenen Visualisierungen der Daten, welche aus selbstorganisierenden Merkmalskarten erzeugt werden.
• Segmentierung: Einteilen der gesamten Daten in homogene Gruppen.
• Vorhersage: Vorhersage von bisher unbekannten Merkmalsausprägungen in einzelnen Datensätzen.
• Simulation: Wie würden sich gewisse Merkmalsausprägungen eines Datensatzes wahrscheinlich ändern, wenn bestimmte andere Merkmalsausprägungen gezielt geändert würden?
• Optimierung: Wenn für eine Teilmenge der Merkmale bestimmte optimale Ausprägungen erreicht werden sollen, wie sollten dann die übrigen Merkmalsausprägungen gewählt werden?

Each neuron of the self-adapting neuron network has as many signal inputs as each of the individual data sets has characteristics. If the neuron network has 'learned' the data, the following tasks can be performed with the trained neuron network, among others:

• Visual interactive data exploration: interactive discovery of interesting subgroups, correlations between features and general contexts using various visualizations of the data generated from self-organizing feature maps.
Segmentation: Divide all data into homogeneous groups.
• Prediction: prediction of previously unknown characteristic values in individual data sets.
• Simulation: How would certain characteristics of a data set likely change? if certain other characteristic values were changed in a targeted way?
• Optimization: If certain optimal characteristics are to be achieved for a subset of the characteristics, how should the other characteristic values be selected?

Bestehende Methoden und Implementierungen SOM-Karten-basierter Datenanalyse benötigen für die kommerzielle Einsetzbarkeit derzeit zu lange Trainingszeiten der Neuronen-Netze. Diese übersteigen die Trainingszeiten anderer Data Mining Techniken auf denselben Daten um etwa das Hundertfache und behindern die Anwendung derartiger existierender Software-Pakete auf viele existierende Datensammlungen und Fragen mit der gegenwärtig zur Verfügung stehenden Rechnerleistung. Um zum Beispiel mit der Software DataCockpit ein SOM-Netzwerk von 30·40 Neuronen auf einer großen Datenbank von 60.000.000 Datensätzen mit 100 Merkmalen zu trainieren, müsste ein Server mit ein bis zwei 3 GHz Intel^® CPUs, 64 GigaByte RAM) etwa 2–3 Monate ununterbrochen rechnen – dies wäre in der Praxis völlig inakzeptabel.Existing methods and implementations of SOM-card-based data analysis currently require too long training times of the neuron networks for their commercial applicability. These exceed the training times of other data mining techniques on the same data by about a hundredfold and hinder the application of such existing software packages to many existing data collections and issues with the currently available computer power. For example, to train a SOM network of 30 · 40 neurons on a large database of 60,000,000 data sets with 100 characteristics using the DataCockpit software, a server with one to two 3 GHz ^Intel® CPUs, 64 GigaByte RAM would need about 2 Calculate continuously for three months - this would be completely unacceptable in practice.

In Dokument US 2007/0118399 A1 ein Informationssystem beschrieben, das Daten in einer integrierten Wissensdatenbank speichert und den Zugang zu Daten sowie deren Auswertung erlaubt. Das Informationssystem integriert dabei Daten von verschiedenen Daten-Quellen. Die gewonnenen Daten werden dann durch logische Komponenten, zum Beispiel mit Hilfe von neuronalen Netzen, verarbeitet. Die verarbeiteten Daten stehen nach der Verarbeitung für einen Abruf durch menschliche oder maschinelle Benutzer bereit.In document US 2007/0118399 A1 an information system that stores data in an integrated knowledge base and allows access to data and their evaluation. The information system integrates data from various data sources. The data obtained are then processed by logical components, for example by means of neural networks. The processed data is ready for retrieval by human or machine users after processing.

Das Dokument mit dem Titel „Knowledge Discovery in Databases” von G. Stumme et al. (Vorlesungsskript, Universität Kassel, 2004) beschreibt allgemein bekannte Verfahren zum Data-Mining. Unter anderem wird ein Einsatz neuronaler Netze für eine Datensegemtierung, eine Attributvorhersage und eine Abweichungsanalyse erwähnt.The Document titled "Knowledge Discovery in Databases "by G. Stumme et al. (Lecture notes, University of Kassel, 2004) well-known methods for data mining. Among other things will a use of neural networks for a data mapping, an attribute prediction and a deviation analysis mentioned.

Technisches ProblemTechnical problem

So besteht die technische Anforderung, diese Trainingszeit durch technische Vorkehrungen signifikant zu reduzieren. Außerdem sollte der benötigte Speicherbedarf durch den Einsatz technischer Maßnahmen nennenswert sinken; das oben genannte Beispiel sollte für die Ausführung einen Hauptspeicher mit wenigen GigaByte RAM erfordern.So is the technical requirement, this training time by technical Significantly reduce precautionary measures. In addition, the required storage space should Significantly lower through the use of technical measures; the example above should be used to run a main memory require a few gigabytes of RAM.

KurzbeschreibungSummary

Zur Problemlösung wird ein elektronisches Datenverarbeitungssystem zur Analyse von Daten mit den Merkmalen des Patentanspruchs 1 vorgeschlagen.to Troubleshooting will be an electronic data processing system for the analysis of Data proposed with the features of claim 1.

Diese Anordnung hat die technische Wirkung, die Effizienz und die Sicherheit der Datenanalyse zu erhöhen. Eine weitere technische Wirkung besteht darin, die Anforderungen an die erforderlichen Computerressourcen gegenüber der herkömmlichen Vorgehensweise zu senken. Schließlich wird die Datenübertragungsgeschwindigkeit und die anschließende Datenverarbeitung positiv beeinflusst.These Arrangement has the technical effect, efficiency and safety to increase the data analysis. Another technical effect is the requirements to the required computer resources over the conventional ones To lower the procedure. Finally, the data transfer speed and the subsequent one Data processing positively influenced.

Die Art der Datenkompression kann an den Aufbau der Daten (boolesch, nummerisch, textuell, etc.) angepasst sein. Dies erlaubt, unterschiedlich strukturierte oder auf verschiedene Weise erfasste Quelldaten (z. B. Flachdateien, Datenbanktabellen, Excel-Tabellen) in eine komprimierte Form zu transformieren, welche nur etwa 5% bis etwa 12% des Speicherbedarfes der Originaldaten hat. Da auch nur diese komprimierte Form der Daten von dem Vor-Ort-Client-Rechner an den Analyse-Server gesendet wird, ist als weiterer technischer Vorteil ein schnellerer Datentransfer mit geringerer Anforderung an den Datenkanal möglich. Die von dem Datentyp abhängige Kompression der Originaldaten bewirkt eine gleichzeitige Anonymisierung der Daten. Die Kompression kann außerdem so erfolgen, dass die Genauigkeit der Daten bei der Kompression/Dekompression das Ergebnis der Analyse nicht ungebührlich verfälscht.The The type of data compression may depend on the structure of the data (Boolean, numeric, textual, etc.). This allows, different structured or differently collected source data (e.g. Flat files, database tables, Excel spreadsheets) into a compressed one Transforming form, which only about 5% to about 12% of the memory requirements the original data has. Because only this compressed form of the data from the on-site client machine Sending to the analysis server is another technical advantage a faster data transfer with less demand on the Data channel possible. The one dependent on the data type Compression of the original data causes a simultaneous anonymization the data. The compression can also be done so that the Accuracy of data in compression / decompression the result the analysis not unduly falsified.

Die komprimierte Datenform ist sehr gut geeignet für Neuronen-Netz-Analysen, aber auch für schnelle interaktive Datenexploration, z. B. durch multivariate Statistiken, bei denen die Ergebnisse in Echtzeit oder beinahe Echtzeit (mit geringer Wartezeit – weniger als einige zehn Sekunden) vorliegen sollen.The compressed data form is very well suited for neuron network analyzes, however also for fast interactive data exploration, eg. By multivariate Statistics where the results are real-time or near real-time (with little wait - less than a few tens of seconds).

Unter Bezugnahme auf 1 dient ein elektronisches Datenverarbeitungssystem zur Analyse von Daten. Das elektronische Datenverarbeitungssystem hat einen Analyse-Server 10 und einen oder mehrere Vor-Ort-Client-Rechner 12. Der Analyse-Server ist zum Beispiel ein PC mit mehreren 3 GHz Intel^® CPUs und 64 GigaByte RAM als Hauptspeicher. Darin ist ein von dem Datentyp abhängige Kompression der Originaldaten bewirkt eine gleichzeitige Anonymisierung der Daten. Die Kompression kann außerdem so erfolgen, dass die Genauigkeit der Daten bei der Kompression/Dekompression das Ergebnis der Analyse nicht ungebührlich verfälscht.With reference to 1 serves an electronic data processing system for the analysis of data. The electronic data processing system has an analysis server 10 and one or more on-premises client computers 12 , The analysis server is for example a PC with several 3 GHz ^Intel® CPUs and 64 GigaByte RAM as main memory. Therein, a compression of the original data dependent on the data type causes a simultaneous anonymization of the data. The compression can also be done so that the accuracy of the data in compression / decompression does not unduly distort the result of the analysis.

Unter Bezugnahme auf 1 dient ein elektronisches Datenverarbeitungssystem zur Analyse von Daten. Das elektronische Datenverarbeitungssystem hat einen Analyse-Server 10 und einen oder mehrere Vor-Ort-Client-Rechner 12. Der Analyse-Server ist zum Beispiel ein PC mit mehreren 3 GHz Intel^® CPUs und 64 GigaByte RAM als Hauptspeicher. Darin ist ein selbst adaptierendes Neuronen-Netz als Datenobjekt zu implementieren, das auf eine große Datenbank mit einer Vielzahl Datensätzen mit vielen Merkmalen zu trainieren ist. Der Vor-Ort-Client-Rechner 12 ist dazu eingerichtet und programmiert, ihm zugeführte Daten einer Datenvorverarbeitung und/oder einer Datenkompression zu unterziehen, bevor die Daten über ein elektronisches Netzwerk 14, zum Beispiel das Internet, an den Analyse-Server 10 gesendet werden. Der Analyse-Server 10 ist außerdem dazu eingerichtet und programmiert, mit den empfangenen, vorverarbeiteten/komprimierten Daten das selbst adaptierende Neuronen-Netz zu trainieren, indem die Daten dem sich selbst adaptierenden Neuronen-Netz wiederholt präsentiert werden und anschließend eine Analyse durchzuführen um ein selbst adaptierende Neuronen-Netz-Modell zu erstellen. Der Analyse-Server bewirkt anschließend ein Versenden des selbst adaptierenden Neuronen-Netz-Modells von dem Analyse-Server 10 an den Vor-Ort-Client-Rechner 12 ebenfalls über das Netzwerk 14. Der Vor-Ort-Client-Rechner 12 ist schließlich dazu eingerichtet und programmiert, die Daten des selbst adaptierenden Neuronen-Netz-Modells einer Dekomprimierung zu unterziehen.With reference to 1 serves an electronic data processing system for the analysis of data. The electronic data processing system has an analysis server 10 and one or more on-premises client computers 12 , The analysis server is for example a PC with several 3 GHz ^Intel® CPUs and 64 GigaByte RAM as main memory. In it, a self-adapting neuron network is to be implemented as a data object to be trained on a large database with a multiplicity of data records with many features. The on-site client machine 12 is adapted and programmed to subject data supplied to it to data preprocessing and / or data compression before the data is transmitted over an electronic network 14 , for example the Internet, to the analysis server 10 be sent. The analysis server 10 is also set up and programmed to train the self-adapting neuron network with the received, preprocessed / compressed data by repeatedly presenting the data to the self-adapting neuron network and then performing an analysis to construct a self-adapting neuron network. Create model. The analysis server then causes the self-adapting neuron network model to be sent by the analysis server 10 to the on-site client machine 12 also over the network 14 , The on-site client machine 12 Finally, it is set up and programmed to decompress the data of the self-adapting neuron network model.

Die Datenkomprimierung kann für mehrere Arten mächtiger interaktiver Datenanalysen und Datenexplorationstechniken benutzt werden, die selbst auf großen Datenquellen von mehr als einem Gigabyte Größe noch ein interaktives Arbeiten in Echtzeit erlauben. Bei einer interaktiven multivariaten Statistik werden für mehrere oder alle Merkmale einer Datensammlung Werteverteilungsdiagramme (Histogramme) nebeneinander auf dem Bildschirm angezeigt. Wenn man in einer interaktiven multivariaten Statistik in einem oder mehreren der Diagramme einen Teil der Histogrammbalken selektiert, werden in den anderen Diagrammen sofort die verbleibenden Häufigkeiten angezeigt. Dies erlaubt einen sehr flexiblen 'drill down' in die Daten, ohne mühsam einen multidimensionalen OLAP-Cube aufbauen und pflegen zu müssen. Das Problem ist, dass sich auf großen Datenbeständen die Antwortzeiten stark verlangsamen. Die Software IBM DB2 Data Warehouse Edition trägt dem Rechnung, indem sie die multivariate Analyse standardmäßig nur auf einem Datenraum von 1000 Datensätzen durchführt. Damit kann man aber nicht erwarten, auf Tabellen von z. B. 10⁶–10⁸ Datensätzen korrekte und verlässliche Ergebnisse zu erhalten. Die vorgestellte Datenkomprimierung bietet einen eleganten Ausweg. Wenn die Datensammlung komprimiert und in komprimierter Form in den Hauptspeicher geladen ist, kann man die multivariate Analyse auch auf Datensammlungen von mehreren Gigabyte Originalgröße noch ohne Sampling in Echtzeit durchführen.Data compression can be used for multiple types of powerful interactive data analysis and exploration techniques that allow for real-time interactive work even on large data sources larger than one gigabyte in size. Interactive multivariate statistics display value distribution diagrams (histograms) side by side on the screen for several or all characteristics of a data collection. If you select one part of the histogram bars in one or more of the diagrams in an interactive multivariate statistic, the remaining frequencies are immediately displayed in the other diagrams. This allows for a very flexible 'drill down' into the data without the hassle of building and maintaining a multidimensional OLAP cube. The problem is that response times slow down on large volumes of data. The IBM DB2 Data Warehouse Edition software takes this into account by performing multivariate analysis by default only on a dataset of 1000 datasets. But you can not expect to see tables of z. B. 10 ⁶ -10 ⁸ records to get correct and reliable results. The presented data compression offers an elegant way out. If the data collection is compressed and loaded into main memory in a compressed form, multivariate analysis can be performed even on real-time data collections of several gigabytes in size without sampling.

Die beschriebene Vorgehensweise erlaubt die Behandlung großer Datenmengen mit einer signifikanten Erhöhung des analysierbaren Datenvolumens. Bei textuellen (nominalen) Daten ist eine drastische Reduzierung des Speicherplatzbedarfs möglich, und die Analysegeschwindigkeit bei überwiegend nicht-nummerischen Daten nimmt signifikant zu. Die Analysege schwindigkeit steigt außerdem durch die Beschleunigung des Trainings der SOM-Modelle. Die Daten-Anonymisierung erfolgt durch die Aufteilung der Daten in einen vertraulichen und einen nicht-vertraulichen Teil. Nur der nicht-vertrauliche Teil wird dem Analyse-Server übermittelt und wird von diesem analysiert. Der vertrauliche Teil bleibt auf dem Vor-Ort-Client. Dieser kann dazu eingerichtet und programmiert sein, bei Eintreffen des anonymisierten Analyse-Ergebnisses vom Analyseserver dieses anonymisierten Ergebnis mit dem vertraulichen Teil der Daten zu einem deanonymisierten Klartext-Analyseergebnis zusammenzuführen.The The procedure described allows the treatment of large amounts of data with a significant increase of the analysable data volume. For textual (nominal) data a drastic reduction in storage space is possible, and the analysis speed at predominantly Non-numerical data increases significantly. The speed of analysis rises as well by speeding up the training of the SOM models. The data anonymization is done by dividing the data into a confidential and a non-confidential part. Only the non-confidential part is transmitted to the analysis server and is analyzed by this. The confidential part stays on the On-site client. This can be set up and programmed to upon arrival of the anonymized analysis result from the analysis server this anonymized result with the confidential part of the data to merge into a deanonymized plaintext analysis result.

Bei der Anonymisierung vertraulicher Daten wird eine Flachdatei im weitesten Sinne (d. h. z. B. eine Komma-, Semikolon-, oder Tabulator-separierte Textdatei mit variabler Spaltenbreite, eine Textdatei mit fester Spaltenbreite, eine Tabelle aus einem Tabellenkalkulationsprogramm wie Microsoft^® Excel^® oder OpenOffice^®, eine Tabelle in einer relationalen, objektorientierten oder XML-Datenbank usw.) komprimiert und dabei gleichzeitig alle potenziell vertraulichen Informationen aus den Daten entfernt. Die vertraulichen Informationen werden in einer separaten Datenbeschreibung gespeichert. Die herausgefilterten Informationen sind zum Beispiel die folgenden: erstens Merkmalsnamen, die ersetzt werden durch die anonymisierte Namen wie z. B. C0, C1, ..., D0, D1, ..., B0, B1, ..., N0, N1, ..., wobei C für ,continuous numeric' steht, D für ,discrete numeric', B für binary (zweiwertig), N für nominal (textuell); zweitens textuelle Merkmalsausprägungen, die ersetzt werden durch anonymisierte Wertausprägungen wie z. B. V, V1, ... oder VALUE0, VALUE1, ..., und drittens nummerische Wertausprägungen, deren tatsächliche Wertebereiche auf eine normierte Verteilung mit Mittelwert m, m = 0 und Streubreite s, s = 1 transformiert werden. Weitere mögliche Anonymisierungen sind zum Beispiel bei nummerischen Merkmalen nicht nur die ersten beiden Momente der Verteilung, m und s, sondern auch noch höhere Momente wie Schiefe oder Kurtosis.In the anonymizing sensitive data is a flat file, in the broadest sense (ie, for example, a comma, semicolon, or tab-delimited text file with variable column width, a text file with a fixed column width, a table of a spreadsheet program such as Microsoft ^® Excel ^® or Open Office ^® , a table in a relational, object-oriented, or XML database, etc.) while removing all potentially sensitive information from the data. The confidential information is stored in a separate data description. The information filtered out is, for example, the following: first, feature names that are replaced by the anonymized names C0, C1, ..., D0, D1, ..., B0, B1, ..., N0, N1, ..., where C stands for 'continuous numeric', D for 'discrete numeric', B for binary, N for nominal (textual); Secondly, textual characteristics which are replaced by anonymised values such as: B. V, V1, ... or VALUE0, VALUE1, ..., and third nummeri value values whose actual value ranges are transformed to a normalized distribution with mean value m, m = 0 and spread s, s = 1. Other possible anonymizations, for example, in numerical features are not only the first two moments of distribution, m and s, but also even higher moments such as skewness or kurtosis.

Nur die anonymisierten komprimierten Daten werden zum Analyse-Server übermittelt, der daraus ein anonymisiertes SOM-Modell erstellt und zurückschickt. Ein anonymisiertes SOM-Modell wird wieder deanonymisiert, indem die vertraulichen Informationen aus der separaten Datenbeschreibung mit dem SOM-Modell rekombiniert werden. Auch wenn vorstehend die Trennung des Analyse-Servers von dem Vor-Ort-Client-Rechner angenommen wurde, ist es auch möglich, die in den beiden Einheiten vorgehaltenen Rechnerleistungen und Softwareprogrammkomponenten in einer Rechnereinheit zusammenzufassen.Just the anonymized compressed data is sent to the analysis server, from which an anonymous SOM model is created and sent back. An anonymized SOM model is again deanonymized by the confidential information recombined from the separate data description with the SOM model become. Even though above the separation of the analysis server from the on-premises client machine was adopted, it is also possible to use the computer services and software program components held in the two units to summarize in a computer unit.

Die oben beschriebene Anonymisierung der Daten beim Komprimierungsvorgang macht den Einsatz einer Client-Server-Architektur zur Datenanalyse auch für vertrauliche Daten möglich: Weil eine Softwareprogrammkomponente für die Komprimierung der Daten im Client vorgesehen ist, kann zum Beispiel ein Produktionsbetrieb, der eine Qualitätsanalyse/-Verbesserung seiner Produktionsabläufe durchführen möchte, beim Betreiber des Analyse-Servers mit der Softwareprogrammkomponente zum Trainieren des selbst adaptierenden Neuronen-Netzes mit Hilfe der empfangenen, komprimierten Daten und der Softwareprogrammkomponente zum Ausführen einer Analyse, auf seinem eigenen Vor-Ort-Client-Rechner die Daten zunächst vorverarbeiten/komprimieren (und dabei anonymisieren), bevor er sie zum Analyse-Server schickt, um dort die Analyse mit den vorverarbeiteten/komprimierten, anonymisierten Daten durchführen zu lassen. Der Analyse-Server schickt das anonymisierte Neuronen-Netz-Modell zurück, und der Vor-Ort-Client-Rechner ersetzt die anonymisierten Werte darin wieder durch die Originalwerte.The Anonymization of the data in the compression process described above makes use of a client-server architecture for data analysis also for confidential data possible: Because a software program component for compressing the data is provided in the client, for example, a production plant, the quality analysis / improvement his production processes carry out would like to, at the operator of the analysis server with the software program component to train the self-adapting neuron network with help received compressed data and software program component to run an analysis on its own on-site client machine's data first preprocessing / compressing (and thereby anonymizing) before he she to the analysis server send there the analysis with the preprocessed / compressed, perform anonymized data allow. The analysis server returns the anonymized neuron network model, and the on-premises client machine replaces the anonymized values in it again through the original values.

Zusätzlich können die anonymisierten Daten während des Übermittlungsvorgangs über ein Netzwerk noch verschlüsselt werden. Die vorstehend beschriebene Datenanonymisierung ist vollkommen verträglich mit herkömmlichen sicheren Übertragungsprotokollen und Verschlüsselungsverfahren wie z. B. PBP, https oder scp. Dabei ist eine Verschlüsselung der zuvor der Anonymisierung unterworfenen Daten jedoch nicht unbedingt erforderlich. Der technische Vorteil der Anonymisierung ist, dass die neben der Verringerung des Datenumfangs (bezogen auf die erfassten Originaldaten) auch noch Vertraulichkeit der Daten nicht nur während des Transfers über das Netzwerk gewahrt bleibt, sondern auch während der gesamten Analyse auf dem Analyseserver. So kann eine weitere Ressourcen erfordernde Verschlüsselung und Entschlüsselung für die Übertragung zwischen dem Analyseserver und dem Vor-Ort-Client eigentlich unterbleiben, da die anonymisierten Daten ohne die Korrelierung zu dem Klartextanteil unverständlich sind. Ein Einblick in die vertraulichen Daten bleibt damit nicht nur einem potenziellen Mitleser während der Übertragung verwehrt, sondern auch dem Betreiber des Analyse-Servers.In addition, the anonymous data during the transfer process via a Network still encrypted become. The data anonymization described above is complete compatible with conventional secure transmission protocols and encryption methods such as PBP, https or scp. Here is an encryption however, not necessarily the data previously subjected to anonymisation required. The technical advantage of anonymization is that in addition to the reduction in the volume of data (based on the recorded Original data) also confidentiality of the data not only during the transfer of the Network is maintained, but also throughout the analysis on the analysis server. So can another resource requiring encoding and decryption for the transmission between the analytic server and the on-premises client are actually omitted, as the anonymized data without the correlation to the plaintext portion incomprehensible are. An insight into the confidential data is not but denied to only one potential co-reader during the transfer also the operator of the analysis server.

Der Analyse-Server kann dazu eingerichtet und programmiert sein, das selbst adaptierende Neuronen-Netz so oft mit den empfangenen anonymisierten Daten zu trainieren, bis sich ein auskonvergierter Netzzustand ergibt, der die Daten angemessen repräsentiert. Vorzugsweise werden die Daten dem selbst adaptierenden Neuronen-Netz etwa 100 bis etwa 200 Mal präsentiert.Of the Analysis Server can be set up and programmed to do this self-adapting neuron network so often with the received anonymized Train data until an out-converged mesh condition results, who adequately represents the data. Preferably, the data becomes the self-adapting neuron network about 100 to about 200 times presented.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, die ihm zugeführten Daten im Umfang von bis zu etwa 10 Gigabyte bis mehreren Terabyte der Datenvorver arbeitung und der Datenkompression zu unterziehen. Die genannten Datengrößen beziehen sich auf typische große Datenbanktabellen und die verfügbare Computertechnologie des Jahres 2008. Wenn die allgemeine Computerleistungsfähigkeit und Datenbankgröße weiterhin exponentiell steigt (,Moore's Law'), werden die genannten Datengrößen proportional mitwachsen.Of the On-site client machine can be set up and programmed to who fed him Data of up to about 10 gigabytes to several terabytes data pre-processing and data compression. The mentioned data sizes relate focus on typical big ones Database tables and the available ones Computer technology of the year 2008. When the general computer efficiency and Database size still exponentially rising (Moore's Law '), the proportions grow.

Der Analyse-Server kann außerdem dazu eingerichtet und programmiert sein, zusätzlich oder anstelle der SOM-Modellierung auch weitere Data-Mining- oder Datenanalyse-Verfahren bereitzuhalten, zum Beispiel Assoziationsregel-Verfahren, Entscheidungsbaumverfahren, Bayessche Verfahren, Regressionsverfahren oder weitere neuronale Analyseverfahren neben dem SOM-Verfahren. Auch diese genannten Verfahren und viele weitere können direkt auf dem vorstehend beschriebenen komprimierten Datenformat aufsetzen und dadurch signifikant schneller ablaufen.Of the Analysis Server can also to be set up and programmed in addition to or instead of SOM modeling also to provide further data mining or data analysis procedures for example, association rule methods, decision tree methods, Bayesian methods, regression methods or other neural Analysis method in addition to the SOM method. Also these mentioned procedures and many more can directly on the compressed data format described above and thus run significantly faster.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, die ihm zugeführten Daten bei der Datenvorverarbeitung einmal zu lesen, und darin enthaltene Originalmerkmale auf rein nummerische normalisierte Merkmale zu transformieren.Of the On-site client machine can be set up and programmed to who fed him Read data once during data preprocessing, and contain it Original features on purely numerical normalized features too transform.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, bei der Datenkompression die normalisierten nummerischen Merkmalsausprägungen der ihm zugeführten Daten komprimiert zu speichern, so dass im Mittel nur zwischen zwei Bit und etwa einem Byte als Speicherplatz pro Merkmalsausprägung benötigt wird.Of the On-site client machine can be set up and programmed to in the data compression the normalized numerical feature values of fed him Store data compressed, so that on average only between two Bit and about one byte is required as storage space per feature expression.

Damit können einerseits die komprimierten Daten bei vielen aufeinander folgenden Analysen – die eventuell mit verschiedenen Analyseverfahren durchgeführt werden – durch ein einmaliges Laden komplett in den Arbeitsspeicher des Analyse-Servers zur Verarbeitung durch die Softwareprogrammkomponente zum Trainieren des selbst adaptierenden Neuronen-Netzes (und/oder eines anderen Analyseverfahrens) und der Softwareprogrammkomponente zum Ausführen einer Analyse geladen werden. Ein wiederholtes, zeitaufwändiges Laden der Daten für jeden Analyseschritt erübrigt sich dadurch. Andererseits erlaubt die angepasste, hohe Kompressionsrate auch, die normalisierten, komprimierten Daten zusätzlich zu den Originaldaten persistent auf dem Massenspeicher zu halten. So kann das sonst übliche 100- bis 200-malige datensatzweise Einlesen, Parsen und in ein für die Analyse geeignetes Format Bringen der Daten entfallen, selbst wenn die komprimierten Daten nicht komplett in den Arbeitsspeicher des verwendeten Computers passen.In order to can on the one hand, the compressed data on many consecutive ones Analyzes - possibly be carried out with different analytical methods - by a single loading completely into the memory of the analysis server for processing by the software program component for training the self-adapting Neuron network (and / or other analysis method) and the software program component to run to be loaded in an analysis. A repeated, time-consuming loading the data for no analysis step is necessary through it. On the other hand, allows the adapted, high compression rate also, the normalized, compressed data in addition to to keep the original data persistent on the mass storage. So can the usual 100 to 200 times Reading records, parsing and in a format suitable for the analysis Bringing the data aside, even if the compressed data not completely in the memory of the computer used fit.

Die Kombination aus Einlesen eines Datensatzes von Festplatte als Zeichenkette und anschließendem Parsen inklusive Abbilden von Zeichenketten auf nummerische Werte kann etwa 10 000 Mal so lange dauern wie der Zugriff auf einen komprimierten, bereits geparsten und normalisierten Datensatz im Arbeitsspeicher (wobei sich ein Faktor von etwa 1000 durch die höhere Zugriffsgeschwindigkeit, und ein Faktor von etwa 10 durch die auf etwa 5% bis etwa 12% reduzierte/komprimierte Größe der Daten ergibt).The Combination of reading a record from disk as a string and then Parsing including mapping of strings to numeric values can take about 10,000 times as long as access to a compressed, already parsed and normalized record in memory (with a factor of about 1000 due to the higher access speed, and a factor of about 10 reduced / compressed by about 5% to about 12% Size of the data results).

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, in Abhängigkeit von einem für die jeweilige Analyseaufgabe akzeptablen Kompressionsfehler ein zu verwendendes Datenkompressionsverfahren festzulegen, wobei das einzusetzende Kompressionsverfahren und die zu erzielende Kompressionrate abhängig von den unterschiedlichen Merkmalstypen (boolesch, nummerisch, nominal (textuell)) und von den Genauigkeitsanforderungen der gewählten Analysetechnik zu wählen ist.Of the On-site client machine can continue to set up and programmed be, depending on one for the respective analysis task will accept acceptable compression errors to specify the data compression method to be used, the Compression methods to be used and the compression rate to be achieved dependent from the different feature types (boolean, numeric, nominal (textual)) and the accuracy requirements of the selected analysis technique to choose is.

Bei Einsatz der SOM-Analysetechnik kann der Vor-Ort-Client-Rechner dabei so eingerichtet und programmiert sein, dass der mittlere Vorhersagefehler für nummerische Merkmale – also die Differenz zwischen dem tatsächlichen normalisierten Merkmalswert eines Datensatzes und dem normalisierten Wert, den das insgesamt am besten zu dem Datensatz passende Neuron für das Merkmal vorhersagt – bei sinnvoll austrainierten SOM-Netzen meist zwischen 0.01 und 0.1 liegt. Der Vorteil ist, dass das Netz auf diese Weise nicht jede zufällig in einem bestimmten Datensatz vorhandene Merkmalsausprägung exakt zu reproduzieren versucht, und dass somit zufällige Schwankungen und Koinzidenzen in den Trainingsdatensätzen nicht als allgemeine Gesetzmäßigkeiten in den Daten gelernt werden (sogenanntes ,Übertrainieren' des Netzes). Vielmehr liefert bei Zulassen eines Fehlers das selbst adaptierende Neuronen-Netz auch dann brauchbare Aussagen, wenn es auf neue Datensätze angewendet wird, die noch nicht in den Trainingsdaten enthalten waren.at The SOM analysis technology can be used by the on-site client computer be set up and programmed so that the mean prediction error for numeric Characteristics - so the difference between the actual normalized feature value of a data set and the normalized Value, which is the best neuron to match the record for the Feature predicts - at meaningfully well-trained SOM networks are usually between 0.01 and 0.1. The advantage is that the network in this way does not happen at random a specific record existing feature expression exactly tries to reproduce, and that thus random fluctuations and coincidences in the training records not as general laws to be learned in the data (so-called 'over-training' of the network). Much more When allowing for an error, the self-adapting neuron network also provides useful ones Statements when applied to new records that are still not included in the training data.

Wenn also ein ,Generalisierungsfehler' des selbst adaptierenden Neuronen-Netzes von 0.01–0.1 pro Merkmal normal ist, dann ist ein durch die Trainingsdaten-Komprimierung erzeugter mittlerer zusätzlicher Fehler von 0.00001 pro Merkmal vernachlässigbar und tolerabel, denn in diesem Fall wären vom selbst adaptierenden Neuronen-Netz zu findende Merkmalsausprägungen auf 3–4 Stellen Genauigkeit vom Diskretisierungsfehler unbeeinflusst.If So a 'generalization error' of self-adapting neuron network of 0.01-0.1 per feature is normal, then a middle one generated by the training data compression additional Error of 0.00001 per feature negligible and tolerable because in this case would be from the self-adapting neuron network to find feature values 3-4 places Accuracy unaffected by discretization error.

Für den Fall, dass auf ein Neuron zum Beispiel etwa 10.000 Datensätze abgebildet werden (was z. B. bei einem Netz von 30·40 Neuronen und etwa 12 Millionen Datensätzen im Durchschnitt der Fall ist), dann ergeben nach dem Gesetz der großen Zahlen die 10.000 un abhängigen und zufälligen Diskretisierungsfehler insgesamt genau dann einen Gesamt-Einfluss der Diskretisierungsfehler von 0.00001, wenn jeder einzelne Diskretisierungsfehler im Mittel 0.001 = 0.00001·√10000 beträgt. Ein mittlerer Diskretisierungsfehler eines einzelnen Datensatzes in einem normalisierten nummerischen Merkmal von 0.001 ist also akzeptabel.In the case, that mapped to a neuron, for example, about 10,000 records (which, for example, in a network of 30 · 40 neurons and about 12 million records this is the case on average), then according to the law huge Pay the 10,000 independent and random Total discretization error then just a total impact the discretization error of 0.00001 if every single discretization error 0.001 = 0.00001 · √10000 on average. One mean discretization error of a single record in a normalized numeric feature of 0.001 is therefore acceptable.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, eine Diskretisierung der normalisierten, nummerischen Merkmalswerte in eine Anzahl diskreter Intervalle so durchzuführen, dass ein mittlerer Diskretisierungsfehler |Wert – DiskretWert| ungefähr 0.001 beträgt. Mit anderen Worten erfolgt eine Abbildung ,Wert → Intervallindex' (und zur Rückrechnung die Umkehrabbildung ,Intervallindex → DiskretWert ≔ Mittelpunkt des Intervalls) so, dass der mittlere Diskretisierungsfehler |Wert – DiskretWert| etwa 0.001 beträgt.Of the On-site client machine can be set up and programmed to a discretization of the normalized numerical feature values in a number of discrete intervals so that a medium discretization error | Value - discrete value | approximately 0.001. In other words, an illustration, Value → Interval Index '(and for recalculation the inverse mapping, interval index → discrete value ≔ center point of the interval) such that the mean discretization error | value - discrete value | is about 0.001.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, den diskretisierten nummerischen Wert auf 8 bit (= 1 Byte) Speicherplatz zu speichern, womit 254 Intervallindizes zur Verfügung gestellt werden, plus 2 Indizes für ,Wert nicht vorhanden' und ,ungültiger Wert'.The on-premises client machine may be configured and programmed to use the discretized num memory value to 8 bits (= 1 byte), thus providing 254 interval indices, plus 2 indices for 'value not available' and 'invalid value'.

Bei Verwendung mit anderen Data-Mining- oder Datenanalyseverfahren kann der Vor-Ort-Client auch dazu eingerichtet und programmiert sein, den diskretisierten nummerischen Wert in mehr oder weniger als 8 bits zu speichern. Erfordert ein Analyseverfahren eine höhere Genauigkeit als das SOM-Verfahren, könnten z. B. auf 10 bits 1022 verschiedene Intervallindizes (plus 2 Indizes für ,Wert nicht vorhanden' und ,ungültiger Wert') gespeichert werden – was den Diskretisierungsfehler gegenüber der Speicherung auf 8 bits um den Faktor 4 reduzieren würde.at Use with other data mining or data analysis techniques the on-site client also be set up and programmed to the discretized store numeric value in more or less than 8 bits. Does an analysis method require more accuracy than the SOM method? could z. For example, on 10 bits, 1022 different interval indexes (plus 2 indexes for, value not available 'and 'invalid value') - what the Discretisation error would reduce the storage to 8 bits by a factor of 4.

Der Vor-Ort-Client-Rechner kann außerdem so eingerichtet und programmiert sein, dass die Intervalleinteilung Wertverteilungs-abhängig, nicht-äquidistant ist. Dabei werden vorzugsweise die Intervall-Breiten der diskreten Teilintervalle in Bereichen hoher Wertedichten als besonders gering festgelegt. Dabei kann der mittlere Diskretisierungsfehler auch bei einer Speicherung auf nur 8 bits für die meisten praktisch relevanten Wertverteilungen (z. B. Normalverteilung, Exponentialverteilung, Weibull-Verteilung) unter etwa 0.001 gehalten werden.Of the On-site client machine can also be set up and programmed so that the interval division Value distribution dependent, non-equidistant is. In this case, preferably the interval widths of the discrete Sub-intervals in areas of high value densities are particularly low established. The mean discretization error can also be when stored to only 8 bits for most practically relevant value distributions (eg normal distribution, exponential distribution, Weibull distribution) be kept below about 0.001.

Der Vor-Ort-Client-Rechner kann dazu so eingerichtet und programmiert sein, dass bei nummerischen Daten-Werteverteilungen, die der Gauß- oder Normalverteilung mit einem Mittelwert m und einer Standardabweichung s folgen,

• etwa 64 Intervalle der Breite s/64 in den Bereichen [m – s, m[ und [m, m + s];
• etwa 32 Intervalle der Breite s/32 in den Bereichen [m – 2s, m – s[ und [m + s, m + 2s[;
• etwa 16 Intervalle der Breite s/16 in den Bereichen [m – 3s, m – 2s[ und [m + 2s, m + 3s[;
• etwa 8 Intervalle der Breite s/8 in den Bereichen [m – 4s, m – 3s[ und [m + 3s, m + 4s[;
• etwa 4 Intervalle der Breite s/4 in den Bereichen [m – 5s, m – 4s[ und [m + 4s, m + 5s[;
• etwa 2 Intervalle der Breite s/2 in den Bereichen [m – 6s, m – 5s[ und [m + 5s, m + 6s[; und
• etwa 1 Intervall unendlicher Breite für ]–∞, m – 6s[ und [m + 6s, ∞[.

festgelegt sind.The on-site client computer may be set up and programmed so that in numerical data value distributions that follow the Gaussian or normal distribution with a mean m and a standard deviation s,

• about 64 intervals of width s / 64 in the ranges [m - s, m [and [m, m + s];
• about 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [;
• about 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [;
• about 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [;
• about 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [;
• about 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [; and
• about 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [.

are fixed.

Bei dieser Verteilung ist die Werte-Dichte im Bereich [m – s, m + s] hoch, sinkt im Abstandsbereich von 1s bis 2s vom Mittelwert stark ab und geht bei noch größeren Abständen vom Mittelwert rasch gegen 0.at This distribution is the value density in the range [m - s, m + s] high, decreases sharply in the distance range of 1s to 2s from the mean off and goes at even greater distances from Mean quickly towards 0.

Die vorstehend beschriebene Intervallaufteilung ist optimiert für eine Speicherung auf 8 bit. Benutzt man mehr (weniger) als 8 bits pro Wert, sind die vorstehend genannten Zahlen pro mehr (weniger) verwendetem bit mit dem Faktor 2 zu multiplizieren (dividieren).The The above-described interval division is optimized for storage on 8 bit. If you use more (less) than 8 bits per value, you are the above numbers per bit more (less) used multiply by a factor of 2 (divide).

Der Vor-Ort-Client-Rechner kann dazu so eingerichtet und programmiert sein, dass die diskreten Intervalle als Funktion von m und s symmetrisch um den Mittelwert m verteilt sind, wobei das Intervall [m – s/64, m[ die Intervallposition 127, und das Intervall [m, m + s/64[ die Position 128 hat, wobei der nummerische Wert, der jedem Intervall zugeordnet wird, der Intervallmittelpunkt ist, und wobei die Intervallpositionen 0 bzw. 255 für ungültige bzw. fehlende Werte reserviert sind. Diese Zahlen sind optimiert für eine Speicherung auf 8 bits. Benutzt man mehr (weniger) als 8 bits pro Wert, sind die vorstehend genannten Zahlen pro mehr (weniger) verwendetem bit mit dem Faktor 2 zu multiplizieren (dividieren).Of the On-site client computers can be set up and programmed to do so be that the discrete intervals as a function of m and s symmetric are distributed around the mean m, the interval [m -s / 64, m [ the interval position 127, and the interval [m, m + s / 64 [the position 128, with the numerical value assigned to each interval which is the interval center, and where the interval positions 0 or 255 for invalid or missing values are reserved. These numbers are optimized for one Store on 8 bits. If you use more (less) than 8 bits per Value, the above numbers are per more (less) used multiply bit by a factor of 2 (divide).

Auf diese Weise ist für zumindest annähernd normalverteilte Daten eine Diskretisierung mit mittlerem Diskretisierungsfehler von nur 0.005s bei 8 bit Speicherbreite pro Wert erreichbar, denn für Werte im Intervall [m – s, m + s] (68% aller Werte) ist der mittlere Diskretisierungsfehler s/256, für Werte in [m – 2s, m – s[ und ]m + s, m + 2s] (28% aller Werte) sind es s/128, und für Werte in [m – 3s, m – 2s[ und ]m + 2s, m + 3s] (4% aller Werte) s/64. Dieser mittlere Diskretisierungsfehler beträgt etwa 0.005·s, ist also etwa um den Faktor 3 kleiner als der bei einer entsprechenden äquidistanten Diskretisierung des Bereiches [m – 7s, m + 7s] in 254 Intervalle zu erreichende, mittlere Diskretisierungsfehler.On this way is for at least approximately normally distributed data discretization with medium discretization error achievable from just 0.005s at 8 bit memory width per value, because for values in the interval [m - s, m + s] (68% of all values) is the mean discretization error s / 256, for Values in [m - 2s, m - s [ and] m + s, m + 2s] (28% of all values) are s / 128, and for values in [m - 3s, m - 2s [ and] m + 2s, m + 3s] (4% of all values) s / 64. This mean discretization error is about 0.005 · s, is therefore smaller by a factor of 3 than that of a corresponding equidistant Discretization of the range [m - 7s, m + 7s] in 254 intervals to be achieved, mean discretization error.

Der Vor-Ort-Client-Rechner kann dazu so eingerichtet und programmiert sein, dass nummerische Merkmale zunächst auf normalisierte Merkmale mit Mittelwert m = 0.5 und einer Standardabweichung von s = 0.25 abgebildet werden, so dass etwa alle Werte (96%) im Bereich zwischen 0 und 1 liegen. Somit sind die normalisierten nummerischen Merkmale vergleichbar mit normalisierten booleschen oder nominalen Merkmalen, deren Werte ebenfalls im Bereich zwischen 0 und 1 liegen.Of the On-site client computers can be set up and programmed to do so Be that numerical features first on normalized features with mean value m = 0.5 and a standard deviation of s = 0.25 be imaged so that approximately all values (96%) range between 0 and 1 are. Thus, the normalized numerical features comparable to normalized Boolean or nominal features, their values are also in the range between 0 and 1.

Damit beträgt der mittlere Diskretisierungsfehler für die normalisierten nummerischen Merkmale etwa 0.0013 und ist hinreichend genau für SOM-Netze.Thus, the mean discretization error for the normalized numerical features is approximately 0.0013 and is sufficiently accurate for SOM networks.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, zur Ermittlung des Mittelwertes (m) und der Standardabweichung (s) des zu diskretisierenden Merkmals einen kleinen Bruchteil von etwa 1 Promille bis etwa 10% der Datensätze zu lesen um daraus für jedes nummerische Merkmal in den Daten einen vorläufigen Mittelwert (m^(vorl)) und eine vorläufige Streubreite (s^(vorl)) zu berechnen.The on-site client computer may be configured and programmed to read a small fraction of about 1 per mille to about 10% of the data sets to determine the mean (m) and standard deviation (s) of the feature to be discretized, for example each numerical feature in the data to calculate a provisional mean (m ^(vorl) ) and a preliminary spread (s ^(vorl) ).

Der Vor-Ort-Client-Rechner kann des Weiteren dazu eingerichtet und programmiert sein, alle Datensätze zu lesen und für alle nummerischen Merkmale eine vorläufige Diskretisierung basierend auf dem vorläufigen Mittelwert (m^(vorl)) und der vorläufigen Streubreite (s^(vorl)) durchzuführen.The on-premises client machine may be further configured and programmed to read all records and for all numerical features a preliminary discretization based on the preliminary mean (m ^(f) ) and the preliminary spread (s ^(f) ) perform.

Der Vor-Ort-Client-Rechner kann dabei dazu eingerichtet und programmiert sein,
65532 äquidistante Intervalle der Breite s^(vorl)/256 zentriert um den vorläufigen Mittelwert (m^(vorl)), sowie
zwei offene Endintervalle ]–∞, m^(vorl) – 32766 / 256·s^(vorl)[ und [m^(vorl) 32766 / 256·s^(vorl), ∞[ und zwei Intervallindizes, welche ,Wert nicht vorhanden' sowie
,ungültiger numerischer Wert' wiedergeben, und
für alle Intervalle die Häufigkeiten zu protokollieren, mit denen ein Wert in die jeweiligen Intervalle fällt.The on-site client computer can be set up and programmed to
65532 equidistant intervals of width s ^(vorl) / 256 centered around the provisional mean (m ^(vorl) ), as well as
two open end intervals] -∞, m ^(vorl) - 32766/256 · s ^(vorl) [and [m ^(vorl) 32766/256 · s ^(vorl) , ∞ [and two interval ^indices , which ^{contain "} value not available" and
Play 'invalid numeric value', and
for all intervals to record the frequencies with which a value falls within the respective intervals.

Diese Vorgehensweise kann als Vor-Diskretisierung vor der eigentlichen Diskretisierung verwendet werden, um Rechenzeit einzusparen. Die Vorgehensweise hat nämlich den technischen Vorteil, dass diese vorläufige Diskretisierung einerseits nur 16 Bit (2 Bytes) Speicher für einen nummerischen Werte benötigt, also den Datenumfang gegenüber einer ,double'-Fließkommazahl auf etwa ein Viertel reduziert, Andererseits ist diese vorläufige Diskretisierung aber auch in einem einzigen Durchlauf durch die gesamten Originaldaten erstellbar, selbst wenn Mittelwert und sonstige Verteilungsparameter (Standardabweichung, Schiefe, Verteilungsform) anfangs noch nicht bekannt sind. Eine direkte Komprimierung in ein 8-bit-Format ist dagegen erst möglich, wenn man die Verteilungsparameter zuvor exakt ermittelt hat – was normalerweise einen (zeitaufwändigen) separaten Lesedurchgang durch die gesamten Originaldaten erfordert. Dieser zusätzliche Lesedurchgang zur exakten Ermittlung von Mittelwert und Standardabweichen (und evtl. weiterer Verteilungsparameter) ist hier vermeidbar, weil das Komprimierungsschema hinreichend granular ist, um zunächst Mittelwert und Standardabweichung hinreichend grob angenähert zu schätzen oder zu raten (oder auf einem kleinen Datenraum näherungsweise zu bestimmen) und danach folgende Effekte abzufangen:

• Verschiebung des Mittelwerts (m ≠ m^(vorl)) Die vorläufige Diskretisierung deckt 128 Standardabweichungen rechts und links von dem vorläufigen Mittelwert m^(vorl) ab und kann daher beträchtliche spätere Mittelwertverschiebungen verkraften.
• Änderung der Streubreite (s ≠ s^(vorl)). Die vorläufige Diskretisierung ist über den gesamten abgedeckten Bereich von 256·s^(vorl) so fein, dass selbst bei einer deutlichen Streubreitenabweichung (s = 0.25·s^(vorl)... 16·s^(vorl)) noch die spätere endgültige Diskretisierung daraus abgeleitet werden kann.
• Es liegt keine Normalverteilung vor. Die vorläufige Diskretisierung benutzt schmale, äquidistante Intervalle und kann Wertehäufungen an beliebigen Stellen der Verteilung fein wiedergeben.

This procedure can be used as pre-discretization prior to the actual discretization to save computation time. The procedure has the technical advantage that this provisional discretization on the one hand requires only 16 bits (2 bytes) of memory for a numerical value, ie reduces the data size to about one quarter compared to a double floating-point number. On the other hand, this preliminary discretization is also can be generated in a single pass through the entire original data, even if mean and other distribution parameters (standard deviation, skewness, distribution) are not yet known at the beginning. Direct compression into 8-bit format, on the other hand, is only possible once the distribution parameters have been accurately determined - which usually requires a (time-consuming) separate read through the entire original data. This additional reading pass for the exact determination of mean and standard deviation (and possibly further distribution parameters) is avoidable here, because the compression scheme is sufficiently granular to first estimate or guess the mean and standard deviation sufficiently roughly approximated (or approximate on a small data space ) and then catch the following effects:

• Shift of the mean value (m ≠ m ^(vorl) ) The preliminary discretization covers 128 standard deviations to the right and left of the provisional average m ^(vorl) and can therefore cope with considerable later changes in the mean value.
• Change of the spreading width (s ≠ s ^(vorl) ). The preliminary discretization is so fine over the entire covered area of 256 · s ^(vorl) that even with a clear spread width ^deviation (s = 0.25 · s ^(vorl) ... 16 · s ^(vorl) ) the final final discretization can be derived from this can be derived.
• There is no normal distribution. The preliminary discretization uses narrow, equidistant intervals and can finely reflect value accumulations anywhere in the distribution.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, die mitprotokollierten die Häufigkeiten der Intervall-Besetzungen zu analysieren und daraus die Werteverteilungsform und die dazu passende endgültige Diskretisierung abzuleiten.Of the On-site client machine can continue to set up and programmed be, who also logged the frequencies of the interval occupations to analyze and from this the value distribution form and the fitting final Derive discretization.

Der Vor-Ort-Client-Rechner kann hierfür dazu eingerichtet und programmiert sein, bei zumindest annähernder Gleichverteilung zwei offene Endintervalle und dazwischen 2ⁿ – 4 äquidistante Intervalle zu bilden, bei denen als Obergrenze des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, dass im unteren Endintervall insgesamt etwa 1/(2ⁿ – 2) aller gültigen Werte liegen und als Breite der 2ⁿ – 4 äquidistanten Intervalle das kleinste Vielfache der vorläufigen Intervallbreite festgelegt ist, welches die Besetzung des verbleibenden oberen Endintervalls auf nicht mehr als 1/(2ⁿ – 2) aller gültigen Werte anwachsen lässt. Hierbei gilt: 1 < n ≤ 64.For this purpose, the on-site client computer can be set up and programmed to form two open end intervals with at least approximately equal distribution and 2 ⁿ -4 equidistant intervals between them, in which one of the provisional interval limits is defined as the upper limit of the lower end interval total of approximately 1 / (2 ⁿ - 2) of all valid values in the lower end interval and the width of the 2 ⁿ -4 equidistant intervals is the smallest multiple of the provisional interval width, which sets the occupation of the remaining upper end interval to not more than 1 / ( 2 ⁿ - 2) of all valid values. Where: 1 <n ≤ 64.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, bei zumindest annähernder Exponentialverteilung (Dichtefunktion d(x) = λ·e^–λx) zwei offene Endintervalle und dazwischen 2ⁿ – 4 Intervalle mit abnehmender Breite so festzulegen, dass als Obergrenze (g₁) des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, so dass im unteren Endintervall insgesamt etwa 1/(2ⁿ – 2) aller gültigen Werte liegen und die Intervallgrenze g_end, bestimmt wird, oberhalb der insgesamt etwa 1/(2ⁿ – 2) aller gültigen Werte liegen, wobei λ aus g₁ und g_end bestimt wird zu λ ≔ In (2ⁿ – 2)/(g_end – g₁), und die Wunschbreite (b) des ersten Zwischenintervalls als b ≔ In((2ⁿ – 3)/(2ⁿ – 2))/λ bestimmt wird. Damit liegen bei einer Exponentialverteilung in diesem Intervall genau 1/(2ⁿ – 2) aller gültigen Werte. Hierbei gilt: 1 < n ≤ 64. Bei einer Speicherbreite von 8 bit (1 Byte) pro Wert ergeben sich folgende Zahlenfaktoren: 252, 254, 1/254, In 254, In(253/254); bei 9 bit: 508, 1/510, In 510 und In(509/510).The on-site client computer can furthermore be set up and programmed to set two open end intervals and at least 2 ⁿ -4 intervals of decreasing width, at least approximately exponential distribution (density function d (x) = λ * e ^-λx ) such that as the upper limit (g ₁ ) of the lower end interval, one of the provisional interval limits is set so that in the lower end interval total about 1 / (2 ⁿ - 2) of all valid values and the interval limit g _end , is determined, above the total of about 1 / (2 ⁿ - 2) of all valid values, where λ is determined from g ₁ and g _end to λ ≔ In (2 ⁿ - 2) / (g _end - g ₁ ), and the desired width (b) of the first intermediate interval as b ≔ In ((2 ⁿ - 3) / (2 ⁿ - 2)) / λ is determined. Thus, with an exponential distribution in this interval exactly 1 / (2 ⁿ - 2) of all valid values. Here: 1 <n ≤ 64. At one Memory width of 8 bits (1 byte) per value, the following numerical factors result: 252, 254, 1/254, In 254, In (253/254); at 9 bit: 508, 1/510, In 510 and In (509/510).

Weiterhin kann der Vor-Ort-Client-Rechner bei zumindest annähernder Exponentialverteilung dazu eingerichtet und programmiert sein, als nächste Intervallgrenze (g₂) eine bestehende vorläufige Intervallgrenze so festzulegen, dass der Betrag der Differenz der Obergrenze (g₁) des unteren Endintervalls minus nächster Intervallgrenze (g₂) minus Wunschbreite (b) (|g₂ – g₁ – b|) minimal wird, die Wunschbreite (b) des ersten Zwischenintervalls mit dem Faktor e^λb zu multiplizieren und die nächsten Intervalle entsprechend zu berechnen. (d. h. minimiere |g₃ – g₂ – b|; multipliziere b mit e^λb, ...)Furthermore, the on-site client computer, at least approximately exponential distribution, can be set up and programmed to set an existing provisional interval limit as the next interval limit (g ₂ ) so that the magnitude of the upper limit difference (g ₁ ) of the lower end interval minus next interval boundary (g ₂₎ minus desired width (b) (| g ₂ - g ₁ - b |) is minimal, the desired width (b) to multiply the first intermediate interval by a factor of e ^{.lambda..sub.B} and to calculate the next intervals accordingly. (ie minimize | g ₃ - g ₂ - b |; multiply b by e ^λb , ...)

Für weitere Verteilungen, z. B. Weibull-Verteilung, logarithmische Verteilung, Poisson-Verteilung, etc. sind vergleichbare Spezialverfahren anzuwenden um die Intervalle zu diskretisieren.For further Distributions, e.g. B. Weibull distribution, logarithmic distribution, Poisson distribution, etc. are comparable special procedures apply to discretize the intervals.

Für Verteilungen, die zu keiner der individuell behandelten Verteilungsformen passen, kann zumindest annähernde Normalverteilung angenommen werden, wobei in diesem Fall der Vor-Ort-Client-Rechner dazu eingerichtet und programmiert sein kann, die am nächsten beim wahren Mittelwert (m) liegende vorläufige Intervallgrenze als Mittelpunkt (m) festzulegen, die Standardabweichung (s) so festzulegen, dass sie der wahren Standardabweichung möglichst nahe kommt und dass s/64 ein Vielfaches der bestehenden Intervallbreite ist, wobei die bestehenden vorläufigen Intervalle zu größeren neuen Intervallen zusammengefasst werden, die eine abnehmende Breitenverteilung von s/64, s/32, s/16, ... haben. Bei von 8 bit abweichender Speicherbreite pro Wert ergeben sich entsprechend andere Zahlenfaktoren.For distributions, that do not fit into any of the individually treated forms of distribution, can be at least approximate Normal distribution are assumed, in which case the on-site client computer can be set up and programmed next to the true mean (m) lying provisional interval limit as the center (m) to set the standard deviation (s) so that it comes as close as possible to the true standard deviation and that s / 64 is a multiple of the existing interval width, where the existing provisional Intervals to larger new ones Intervals are summarized that have a decreasing width distribution from s / 64, s / 32, s / 16, .... With 8 bit different memory width per Value result according to other numerical factors.

Die bisher beschriebenen Verfahren inklusive einer eventuellen Vor-Diskretisierung in ein 16-bit-Format erlauben, eine auf die tatsächliche Werteverteilung jedes nummerischen Merkmals angepasste, hinreichend genaue Diskretisierung zu erzielen, welche mit 8 bit Speicherplatz pro Merkmal, oder einer anderen auf die anzuwendenden Analyseverfahren zugeschnitte Speicherbreite, auskommt. Die Originaldaten müssen dafür nur ein Mal komplett gelesen werden.The previously described methods including a possible pre-discretization in a 16-bit format allow one to the actual one Value distribution of each numerical feature adapted, sufficient to achieve exact discretization, which with 8 bit space per feature or another, tailored to the analytical methods to be used Memory width, gets by. The original data only needs to be read once become.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, boolesche Merkmale in folgender Form auf 2 bit Speicherplatz zu speichern: (0: erster valider Wert, 1: zweiter valider Wert, 2: ,Wert nicht vorhanden', 3: ,ungültiger Wert'). Dies benötigt 2 bit Speicherplatz. Hat ein boolesches (= zweiwertiges) keine nicht vorhandenen oder ungültigen Werte, kommt man sogar mit 1 bit Speicherplatz aus (0: erster valider Wert, 1: zweiter valider Wert).Of the On-site client machine can continue to set up and programmed be, Boolean features in the following form on 2-bit space to store: (0: first valid value, 1: second valid value, 2: 'Value not available', 3:, invalid Value'). This requires 2 bits Space. Has a boolean (= bivalent) nonexistent or invalid Values, you even get out with 1 bit memory (0: first valid Value, 1: second valid value).

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, nominale (textuelle) Merkmale für die Verwendung in der SOM-Karten-Analyse in mehrere boolesche bzw. nummerische 0/1-Merkmale aufzuspalten, eines für jeden validen Nominalwert des nominalen Merkmals. In einem gegebenen Datensatz kann immer höchstens eines dieser 0/1-Merkmale den Wert 1 haben. Daher kann eine Datenspeicherung auf die Weise vorgenommen werden, dass anstelle des Nominalwerts eines Merkmals die Position (Index) dieses Wertes in der Liste aller validen Werte dieses Merkmals gespeichert wird. Zusätzlich können zwei Indexwerte geführt werden, welche ,kein Wert vorhanden' und ,Wert kommt in der Liste gültiger Werte nicht vor' repräsentieren.Of the On-site client machine can continue to set up and programmed be nominal (textual) features for use in SOM map analysis split into several Boolean or numeric 0/1 features, one for any valid nominal value of the nominal feature. In a given Record can always be at most one of these 0/1 characteristics has the value 1. Therefore, a data storage be made in the way that instead of the face value a feature the position (index) of that value in the list of all valid values of this feature is stored. In addition, two can Index values will, which, no value exist 'and, value does not come in the list of valid values before 'represent.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, bei nominalen Merkmalen mit sehr vielen verschiedenen Ausprägungen die weniger häufig vorkommenden zu einer oder mehreren Gruppen zusammenzufassen. Zum Beispiel ist für die SOM-Methode normalerweise eine einzelne Behandlung von mehr als etwa 15 nominalen Merkmalsausprägungen pro Merkmal nicht sinnvoll. In diesem Fall kann man die 14 häufigsten Ausprägungen als Einzelwerte behandeln und alle weiteren Ausprägungen zur Gruppe ,sonstige' zusammenfassen.Of the On-site client machine can continue to set up and programmed be, with nominal characteristics with very many different forms the less often occurring to one or more groups. To the Example is for The SOM method is usually a single treatment of more as about 15 nominal feature values per feature does not make sense. In this case you can choose the 14 most common ones manifestations treat as individual values and all other characteristics for Group, miscellaneous'.

Der Vor-Ort-Client-Rechner kann hierfür dazu eingerichtet und programmiert sein, folgende Schritte auszuführen:
Alle Original-Datensätze werden gelesen und für jedes nominale Merkmal in den Original-Daten wird ein zum ersten Mal vorkommender Wert in einem ,Wörterbuch' abgespeichert, in dem jedem vorkommenden Wert eine Index-Nummer zugeordnet und die jeweilige Vorkommenshäufigkeit erfasst wird. Wenn eine benutzerdefinierte Schranke, z. B. etwa 30, 1000, oder 65534, unterschiedliche Werte in dem Wörterbuch eingetragen sind, beende das Einfügen neuer Werte und führe alle danach vorkommenden Werte, die nicht im Wörterbuch auf treten, unter der vorletzten letzten Indexposition, welche ,anderer Wert' wiedergibt, während die letzte Indexposition ,kein Wert vorhanden' wiedergibt.The on-premises client machine may be set up and programmed to do the following:
All original data records are read and for each nominal feature in the original data a first occurring value is stored in a 'dictionary' in which each index value is assigned an index number and the respective occurrence frequency is recorded. If a custom barrier, e.g. 30, 1000, or 65534, different values are entered in the dictionary, terminate the insertion of new values and perform all subsequent values that do not occur in the dictionary, below the penultimate last index position, which represents 'other value', while the last index position, no value exists'.

Die Nominalwerte in dem Wörterbuch werden durch die Indexnummer (8 Bit oder 16 Bit Ganzzahl) ersetzt. Dies reduziert den Datenumfang bereits beträchtlich gegenüber der vorher erforderlichen Speicherung von Zeichenketten.The Nominal values in the dictionary are replaced by the index number (8-bit or 16-bit integer). This already significantly reduces the amount of data compared to the previously required storage of strings.

Das Wörterbuch wird nach absteigender Vorkommenshäufigkeit sortiert.The dictionary is sorted by decreasing frequency of occurrence.

Die 14 häufigsten Werte werden als separate Werte betrachtet und ihnen werden die Index-Nummer 0 bis 13 zugewiesen. Allen anderen Werten im Wörterbuch, und auch dem bisherigen Index ,anderer Wert' wird der Wert 14 zugewiesen. Aus dem bisherigen Index ,Wert nicht vorhanden' wird der Index 15. In den vorläufig komprimierten Daten werden die vorläufigen Index-Nummern durch die neuen Index-Nummern ersetzt. Jede Index-Nummer kann dann auf nicht mehr als 4 Bit Speicherplatz gespeichert werden.The 14 most common Values are considered separate values and they become the Index number 0 to 13 assigned. All other values in the dictionary, and also the previous ones Index, other value 'becomes the Assigned value 14. From the previous index, value not available 'the index becomes 15. In the preliminarily compressed Data will be preliminary Index numbers replaced by the new index numbers. Each index number can then be stored on no more than 4 bits of memory.

Insgesamt wird mit diesem Verfahren eine auf die tatsächlichen Wertehäufigkeiten jedes nominalen Merkmals angepasste, hinreichend genaue Komprimierung erzielt, welche mit 4 Bit Speicherplatz pro nominalem Merkmal auskommt. Die Originaldaten mussten dafür nur ein Mal komplett gelesen werden.All in all This method is used to set the actual value frequencies each nominal feature adapted, sufficiently accurate compression achieved, which manages with 4 bits of memory per nominal feature. The original data needed it only once to be read completely.

Sofern mehr als 14 Einzelwerte betrachten werden sollen, kann der neue Index nicht 4 Bit, sondern 8 Bit Speicherplatz einnehmen. Dann kann man bis zu 254 verschiedene Einzelwerte darstellen.Provided More than 14 individual values should be considered, the new Index not 4 bits, but occupy 8 bits of memory space. Then can one can represent up to 254 different individual values.

Diese Vorgehensweise eignet sich für eine Parallelisierung auf einem Mehrprozessor-Rechner oder Rechner-Netzwerk. Die vorläufige Komprimierung kann auf partitionierten Daten parallelisiert werden. Anschließend erfolgt ein Datenaustausch zwischen den parallelen Threads, um globale Statistiken (Mittelwert, Standardabweichung) bzw. Wertehäufigkeiten zu ermitteln. Nachdem diese Informationen zwischen den einzelnen Threads kommuniziert wurden, erfolgt die endgültige Komprimierung parallelisiert.These Approach is suitable for a parallelization on a multiprocessor computer or computer network. The preliminary Compression can be parallelized to partitioned data. Subsequently there is a data exchange between the parallel threads to global Statistics (mean, standard deviation) or value frequencies to investigate. Having this information between the individual Threads were communicated, the final compression takes place in parallel.

Die beschriebenen Komprimierungstechniken bewirken außerdem eine Anonymisierung der Daten. Wenn zum Beispiel ein Nutzer eines Datenanalyse-Servers dem Server seine Daten aus Datenschutz- oder Geheimhaltungsgründen nicht unverschlüsselt bereitstellen möchte, kann er die Datenkomprimierung auf seinem eigenen Rechner (Vor-Ort-Client-Rechner) durchführen. Die komprimierten Daten (welche nur noch Intervall-Indizes für nummerische Daten und Wert-Indizes für binäre und nominale Daten enthalten) werden zu dem Datenanalyse-Server übermittelt. Die ,Dekomprimierungs-Informationen – d. h. für nummerische Daten Mittelwerte, Standardabweichungen, Verteilungsform, Minimum, Maximum und evtl. weitere Information, die zur Diskretisierung verwendet wurde, und für binäre und nominale Daten die Werte-”Wörterbücher”, welche einen Rückschluss vom Werteindex zum tatsächlichen Wert ermöglichen, verbleiben auf dem Vor-Ort-Client-Rechner. Somit kann ein unbefugter Betrachter der komprimierten Daten mit diesen nichts anfangen. Der Analyse-Server erstellt Analysen, Auswertungen, SOM-Modelle usw. auf Basis der Intervall- und Werteindizes der komprimierten Daten und schickt diese Ergebnisse zurück an den Vor-Ort-Client-Rechner. Dieser ist dazu eingerichtet und programmiert, die Ergebnisse mit den Dekomprimierungsinformationen zu verknüpfen, wodurch die Ergebnisse mit den ursprünglichen Informationen zur Verfügung stehen.The compression techniques described also cause a Anonymization of the data. For example, if a user of a data analysis server the server its data for privacy or confidentiality reasons not unencrypted want to provide can he compress the data on his own computer (on-site client computer) carry out. The compressed data (which only has interval indices for numeric Data and value indices for binary and nominal data) are transmitted to the data analysis server. The, Decompression Information - d. H. for numerical data averages, Standard deviations, distribution form, minimum, maximum and possibly further information used for discretization, and for binary and nominal Data the value "dictionaries" which a conclusion from the value index to the actual one Enable value, remain on the on-premises client machine. Thus, an unauthorized Viewers of compressed data do not care about these. Of the Analysis Server creates analyzes, evaluations, SOM models, etc. based on the interval and value indices of the compressed data and send back those results to the on-site client machine. This is furnished and programs the results with the decompression information to link, which results with the original information for disposal stand.

Durch die automatische Diskretisierung von nummerischen Werten und durch die Zusammenfassung von selten vorkommenden Nominalwerten zur Gruppe ,andere' werden automatisch gesetzliche Regelungen und Vorschriften eingehalten, welche Auswertungen und Analysen verbieten, wenn die Ergebnissgruppen so klein sind, dass daraus auf Einzelpersonen geschlossen werden könnte.By the automatic discretization of numerical values and by the summary of seldom occurring nominal values of the group to become 'others' automatically complied with legal regulations and regulations, which evaluations and analyzes prohibit when the result groups so small that it can be closed to individuals could.

Die nachfolgende Beispielimplementierung enthält folgende Einschränkungen/Vereinfachungen gegenüber dem allgemeinen Konzept:
Bei der Kompression nummerischer Merkmale wurden nicht die Spezialdiskretisierungen für bestimmte Verteilungsformen implementiert, sondern nur das Basisverfahren, welches annähernde Normalverteilung annimmt.The following example implementation contains the following restrictions / simplifications compared to the general concept:
Compressing numerical features did not implement the special discretizations for particular distributions, but only the basic method, which approximates normal distribution.

Binäre und nominale Merkmale werden nicht unterschieden sondern als ,kategorische' Merkmale auf 4 Bit Speichergröße komprimiert.Binary and nominal Features are not distinguished but as 'categorical' features on 4 bits Memory size compressed.

Die Beispielimplementierung ist in der Programmiersprache C++ programmiert. Folgende Stil-Konventionen wurden befolgt: Variablennamen und Funktionsnamen beginnen mit Kleinbuchstaben, Typen und Klassen mit Großbuchstaben. Konstanten bestehen nur aus Großbuchstaben. Instanzvariablen von Klassen haben das Präfix ,iv', bzw. ,piv', wenn es sich um eine Zeiger-Variable handelt.The Example implementation is programmed in C ++ programming language. The following style conventions were followed: variable names and function names begin with lowercase letters, types and uppercase classes. Constants consist only of uppercase letters. Instance variables of classes have the prefix 'iv', or 'piv' if it is a pointer variable is.

Die Implementierung besteht aus einem Aufzählungstyp und 4 Haupt-Klassen:
enum FieldType {CONTINUOUS, DISCR_NUMERIC, BINARY, NOMINAL} beschreibt die Merkmalstypen (Gleitkommazahl, Ganzzahl, Binär, Nominal).The implementation consists of an enumeration type and 4 main classes:
enum FieldType {CONTINUOUS, DISCR_NUMERIC, BINARY, NOMINAL} describes the feature types (Floating-point number, integer, binary, nominal).

Die Klasse GaussianCompress führt die Komprimierung und Dekomprimierung eines nummerischen Merkmals durch (unter der Annahme, dass die Werteverteilung annähernd normalverteilt ist).The Class GaussianCompress leads the compression and decompression of a numerical feature by (assuming that the value distribution is approximately normally distributed is).

Die Klasse DataDescription beschreibt die Trainingsdaten: Merkmalsnamen, Merkmalstypen, Anzahl verschiedener valider Werte der nominalen und diskret nummerischen Merkmale, Mittelwerte und Standardabweichungen der nummerischen Merkmale.The Class DataDescription describes the training data: Characteristic types, number of different valid values of the nominal and discrete numerical features, means and standard deviations the numerical characteristics.

Die Klasse enthält ein Objekt vom Type GaussianCompress für jedes nummerische Merkmal. Die Klasse DataRecord enthält die Daten eines einzelnen Datensatzes. Die Klasse ist in der Lage, das binär komprimierte Datenformat aus einem Objekt vom Typ DataPage zu lesen und zu dekomprimieren (wobei sie die Objekte vom Typ GaussianCompress verwendet, welche sie in der DataDescription des DataPage-Objekts findet).The Class contains a GaussianCompress object for each numeric feature. The class DataRecord contains the data of a single record. The class is able to the binary to read compressed data format from an object of type DataPage and decompress (using the objects of type GaussianCompress used them in the DataDescription of the DataPage object place).

Die Klasse DataPage enthält eine Serie von komprimierten Datensätzen, welche sie mit Hilfe der Methode appendDataRecord einlesen und dabei komprimieren kann, und mit der Methode retrieveNextDataRecord in Form eines Objekts vom Typ DataRecord wieder auslesen und dabei automatisch dekomprimieren. Jede Klassen-Instanz enthält also einen Teil oder alle Trainingsdaten in komprimierter Form für das Training eines SOM-Netzes. Die Methode readRecordFromDataPage dekomprimiert den Datensatz, welcher ab der Speicheradresse pData gespeichert ist.The Class DataPage contains a series of compressed records, which they help with the method appendDataRecord can read and compress, and with the method retrieveNextDataRecord in the form of an object DataRecord read again and decompress it automatically. each Contains class instance So a part or all training data in compressed form for the training of a SOM network. The readRecordFromDataPage method decompresses the Data record which is stored from the memory address pData.

Die dekomprimierten nummerischen Werte werden in das Feld pivNumValues geschrieben, die dekomprimierten binären und nominalen Feldwerte in das Feld pivCatValues.The decompressed numeric values are in the pivNumValues field written, the decompressed binary and nominal field values into the field pivCatValues.

Die Methode appendDataRecord komprimiert den Datensatz dataRecord, welcher in Form einer Zeichenkette vorliegt, bei der die einzelnen Merkmalsausprägungen durch das Separator-Zeichen ivDataDescr.getSeparator() getrennt sind.The Method appendDataRecord compresses the record dataRecord, which exists in the form of a string, in which the individual feature values by the separator character ivDataDescr.getSeparator () are separated.

Die Klasse GaussianCompress speichert Mittelwert m und Standardabweichung s einer Verteilung nummerischer Merkmalsausprägungen. Außerdem bildet die Klasse jeden beliebigen nummerischen Wert auf eines von 256 diskreten Intervallen abzubilden, d. h. auf einen diskreten Wert zwischen 0 und 255, und auch umgekehrt zu einem gegebenen Intervallindex den nummerischen Wert des Intervallmittelpunktes zu liefern.The Class GaussianCompress stores mean m and standard deviation s a distribution of numerical feature values. Besides, the class makes everyone any numerical value to any of 256 discrete intervals depict, d. H. to a discrete value between 0 and 255, and vice versa to a given interval index the numeric To deliver the value of the interval center.

Die diskreten Intervalle sind wie folgt als Funktion von m und s definiert

• 64 Intervalle der Breite s/64 in den Bereichen [m – s, m[ und [m, m + s]
• 32 Intervalle der Breite s/32 in den Bereichen [m – 2s, m – s[ und [m + s, m + 2s[
• 16 Intervalle der Breite s/16 in den Bereichen [m – 3s, m – 2s[ und [m + 2s, m + 3s[
• 8 Intervalle der Breite s/8 in den Bereichen [m – 4s, m – 3s[ und [m + 3s, m + 4s[
• 4 Intervalle der Breite s/4 in den Bereichen [m – 5s, m – 4s[ und [m + 4s, m + 5s[
• 2 Intervalle der Breite s/2 in den Bereichen [m – 6s, m – 5s[ und [m + 5s, m + 6s[
• 1 Intervall unendlicher Breite für ]–∞, m – 6s[ und [m + 6s, ∞[.

The discrete intervals are defined as a function of m and s as follows

• 64 intervals of width s / 64 in the ranges [m - s, m [and [m, m + s]
• 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [
• 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [
• 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [
• 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [
• 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [
• 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [.

Die Intervalle sind symmetrisch um den Mittelwert m verteilt. Das heißt, das Intervall [m – s/64, m[ hat die Intervallposition 127, und [m, m + s/64[ die Position 128. Der jedem Intervall zugeordnete nummerische Wert ist der Intervallmittelpunkt. Das bedeutet, dem Intervall 127 ist der Wert m – s/128 zugeordnet, und dem Intervall 128 der Wert m + s/128. Dem Intervall 1 wird der Wert m – 6.5s zugeordnet, dem Intervall 254 der Wert m + 6.5s. Die Intervallpositionen 0 bzw. 255 sind reserviert für ungültige bzw. fehlende (SQL NULL) Werte. Der diesen Positionen zugeordnete nummerische Wert ist DBL_MAX. class GaussianCompress

class DataDescription

class DataRecord

void Data Record::readRecordFromDataPage(const DataDescription&descr, const unsigned char* const pData)

class DataPage

bool DataPage::appendDataRecord(const string&dataRecord)

The intervals are distributed symmetrically around the mean m. That is, the interval [m - s / 64, m [has the interval position 127, and [m, m + s / 64] the position 128. The numerical value assigned to each interval is the interval center. That is, the interval 127 is assigned the value m - s / 128, and the interval 128 is the value m + s / 128. The interval 1 is assigned the value m - 6.5s, the interval 254 the value m + 6.5s. The interval positions 0 and 255 are reserved for invalid or missing (SQL NULL) values. The numeric value assigned to these positions is DBL_MAX. class GaussianCompress

class DataDescription

class DataRecord

void Data Record :: readRecordFromDataPage (constDataDescription & descr, const unsigned char * const pData)

class DataPage

bool DataPage :: appendDataRecord (const string & dataRecord)

Diese Computerprogrammobjekte sind zur Ausführung in einem elektronischen Datenverarbeitungssystem mit wenigstens einem Analyse-Server und wenigstens einem Vor-Ort-Client-Rechner vorgesehen. Der Analyse-Server hat ein oder mehrere Computerprogrammobjekte um ein selbst adaptierendes Neuronen-Netz zu implementieren, das auf eine Datenbank mit einer Vielzahl Datensätzen mit vielen Merkmalen zu trainieren ist. Der Vor-Ort-Client-Rechner hat ein oder mehrere Computerprogrammobjekte, um ihm zugeführte Daten einer Datenvorverarbeitung und/oder einer Datenkompression zu unterziehen, und um die Daten von dem Vor-Ort-Client-Rechner an den Analyse-Server zu senden. Ein oder mehrere Computerprogrammobjekte des Analyse-Servers trainieren mit den empfangenen, vorverarbeiteten/komprimierten Daten das selbst adaptierende Neuronen-Netz, indem die Daten dem sich selbst adaptierenden Neuronen-Netz wiederholt präsentiert werden. Ein oder mehrere Computerprogrammobjekte des Analyse-Servers führen anschließend eine Analyse durch um ein selbst adaptierendes Neuronen-Netz-Modell oder ein anderes Data-Mining-Analyseresultat zu erstellen. Ein oder mehrere Computerprogrammobjekte des Analyse-Servers bewirken ein Versenden des selbst adaptierenden Neuronen-Netz-Modells oder sonstigen Data-Mining-Analyseresultats von dem Analyse-Server an den Vor-Ort-Client-Rechner. Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners unterziehen die Daten des selbst adaptierenden Neuronen-Netz-Modells oder sonstigen Data-Mining-Analyseresultats einer Dekomprimierung.These computer program objects are intended for execution in an electronic data processing system with at least one analysis server and at least one on-site client computer. The analysis server has one or more computer program objects to implement a self-adapting neuron network that is to be trained on a database having a plurality of records with many features. The on-premises client computer has one or more computer program objects for data pre-processing and / or data compression supplied to it and for sending the data from the on-site client computer to the analysis server. One or more analysis server computer program objects use the received, preprocessed / compressed data to train the self-adapting neuron network by repeatedly presenting the data to the self-adapting neuron network. One or more analysis server computer program objects then perform an analysis by creating a self-adapting neuron network model or other data mining analysis result. One or more computer program objects of the analysis server cause the self-adapting neuron network model or other data mining analysis result to be sent from the analysis server to the on-premises client computer. One or more computer program objects of the on-premises client computer subject the data of the self-adapting neuron network model or other data mining analysis result to decompression.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die Art der Datenkompression an die Art oder den Aufbau der Daten anpassen.One or more computer program objects of the on-premises client computer can the type of data compression to the type or structure of the data to adjust.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können das selbst adaptierende Neuronen-Netz so oft mit den empfangenen anonymisierten Daten trainieren, bis sich ein auskonvergierter Netzzustand ergibt, der die Daten angemessen repräsentiert.One or multiple computer program objects of the Analysis Server can do this self-adapting neuron network so often with the received anonymized Train data until a converged mesh condition results who adequately represents the data.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können die Daten dem selbst adaptierenden Neuronen-Netz etwa 100 bis etwa 200 Mal präsentieren.One or more than one computer program object of the analysis server, the Data to the self-adapting neuron network about 100 to about 200 Present time.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die ihm zugeführten Daten im Umfang von bis zu etwa 10 Gigabyte bis mehreren Terabyte der Datenvorverarbeitung und der Datenkompression unterziehen.One or more computer program objects of the on-premises client computer can who fed him Data of up to about 10 gigabytes to several terabytes subject to data preprocessing and data compression.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die ihm zugeführten Daten bei der Datenvorverarbeitung einmal lesen, und darin enthaltene Originalmerkmale auf rein nummerische normalisierte Merkmale transformieren.One or more computer program objects of the on-premises client computer can who fed him Read data once during data preprocessing and contain it Transform original features to purely numerical normalized features.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei der Datenkompression die normalisierten nummerischen Merkmalsausprägungen der ihm zugeführten Daten komprimieren, so dass zwischen zwei Bit und etwa 8–10 bit als Speicherplatz pro Merkmalsausprägung benötigt wird.One or more computer program objects of the on-premises client computer can in the data compression the normalized numerical feature values of fed him Compress data so that between two bits and about 8-10 bits is required as storage space per feature characteristic.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können mindestens ein Teilobjekt der komprimierten Daten jeweils im Arbeitsspeicher zur Verarbeitung halten, während andere Teilobjekte der komprimierten Daten in Form binärer Datenobjekte auf dem Massenspeicher des Analyse-Servers gehalten werden, von wo sie durch blockweise Leseoperationen in den Arbeitsspeicher zur Verarbeitung gelesen werden.One or more than one computer program object of the analysis server can at least a sub-object of the compressed data in each case in the main memory to keep processing while other sub-objects of the compressed data in the form of binary data objects be held on the mass storage of the analysis server from where they are read into memory by block reads Processing be read.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können in Abhängigkeit von einem für die jeweilige Analyseaufgabe akzeptablen Kompressionsfehler ein zu verwendendes Datenkompressionsverfahren festlegen, wobei das verwendendes Kompressionsverfahren und die zu erzielende Kompressionrate abhängig von den unterschiedlichen Merkmalstypen (boolesch, nummerisch, nominal (textuell)) festgelegt werden.One or more computer program objects of the on-premises client computer can in dependence of one for the respective analysis task will accept acceptable compression errors specify the data compression method to be used, the using the compression method and the compression rate to be achieved dependent from the different feature types (boolean, numeric, nominal (textual)).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können den mittleren Vorhersagefehler für nummerische Merkmale zwischen etwa 0.01 und etwa 0.1 legen.One or more computer program objects of the on-premises client computer can the mean prediction error for set numerical features between about 0.01 and about 0.1.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine Diskretisierung der normalisierten, nummerischen Merkmalswerte in eine Anzahl diskreter Intervalle so durchführen, dass ein mittlerer Diskretisierungsfehler (|Wert – DiskretWert|) etwa 0.001 beträgt.One or more computer program objects of the on-premises client computer can a discretization of the normalized numerical feature values into a number of discrete intervals so that a medium discretization error (| Value - discrete value |) is about 0.001.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können den diskretisierten nummerischen Wert auf 2ⁿ – 2 Intervallindizes, sowie Zustände ,Wert nicht vorhanden' und ,ungültiger Wert' abbilden und auf n bit Speicherplatz speichern, wobei 1 < n ≤ 64.One or more on-premises client computer program objects may map the discretized numerical value to 2 ⁿ -2 interval indexes, as well as "value not present" and "invalid value" and store it on n bit of memory space, where 1 <n ≤ 64 ,

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die Intervalleinteilung Wertverteilungs-abhängig, nicht-äquidistant festlegen, wobei vorzugsweise die Intervall-Breiten der diskreten Teilintervalle in Bereichen hoher Wertedichten als besonders gering festgelegt werden.One or more computer program objects of the on-premises client computer can the interval division depends on the value distribution, non-equidistant set, preferably the interval widths of the discrete Sub-intervals in areas of high value densities are particularly low be determined.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei nummerischen Daten-Werteverteilungen, die der Gauß- oder Normalverteilung mit einem Mittelwert (m) und einer Standardabweichung (s) folgen,

festlegen.One or more computer program objects of the on-premises client computer may follow numerical data value distributions that follow the Gaussian or normal distribution with a mean (m) and a standard deviation (s),

establish.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die diskreten Intervalle als Funktion von Mittelwert (m) und Standardabweichung (s) symmetrisch um den Mittelwert m verteilen, wobei vorzugsweise

– das Intervall [m – s/64, m[ die Intervallposition 127, und
– das Intervall [m, m + s/64[ die Position 128 hat, und wobei vorzugsweise
– der nummerische Wert, der jedem Intervall zugeordnet wird, der Intervallmittelpunkt ist,

und die Intervallpositionen 0 bzw. 255 für ungültige bzw. fehlende Werte reserviert sind.One or more on-premises client computer program objects may distribute the discrete intervals symmetrically about the mean m as a function of mean (m) and standard deviation (s), preferably

- the interval [m - s / 64, m [the interval position 127, and
- the interval [m, m + s / 64 [has the position 128, and preferably
The numerical value assigned to each interval that is the interval center,

and the interval positions 0 and 255 are reserved for invalid or missing values.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können nummerische Merkmale zunächst auf normalisierte Merkmale mit Mittelwert m = 0.5 und einer Standardabweichung von s = 0.25 abbilden.One or more computer program objects of the on-premises client computer can numerical features first on normalized features with mean m = 0.5 and one standard deviation of s = 0.25.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können zur Ermittlung des Mittelwertes (m) und der Standardabweichung (s) des zu diskretisierenden Merkmals einen Bruchteil von etwa 1 Promille bis etwa 10% der Datensätze zu lesen um daraus für jedes nummerische Merkmal in den Daten einen vorläufigen Mittelwert (m^(vorl)) und eine vorläufige Streubreite (s^(vorl)) zu berechnen.One or more computer program objects of the on-premises client computer may use to read the fractional value (m) and the standard deviation (s) of the feature to be discretized from a fraction of about 1 per thousand to about 10% of the data records for each numerical feature to calculate in the data a preliminary mean (m ^(vorl) ) and a preliminary spread (s ^(vorl) ).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können alle Datensätze zu lesen und für alle nummerischen Merkmale eine vorläufige Diskretisierung basierend auf dem vorläufigen Mittelwert m^(vorl) und der vorläufigen Streubreite (s^(vorl)) durchführen.One or more computer program ^{objects of the on-premises} client computer can read all records and ^perform preliminary discretization for all numerical features based on the preliminary average m ^(prel) and the preliminary spread (s ^(prel) ).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können 65532 äquidistante Intervalle der Breite s^(vorl)/256 zentriert um den vorläufigen Mittelwert (m^(vorl)), sowie
zwei offene Endintervalle ]–∞, m^(vorl) – 32766 / 256·s^(vorl)[ und [m^(vorl) + 32766 / 256·s^(vorl), ∞[ und zwei Intervallindizes, welche ,Wert nicht vorhanden' sowie
,ungültiger numerischer Wert' festlegen, und
für alle Intervalle die Häufigkeiten protokollieren, mit denen ein Wert in die jeweiligen Intervalle fällt.One or more computer program objects of the on-premises client machine may have 65532 equidistant intervals of width s ^(vorl) / 256 centered about the provisional mean (m ^(prel) ), as well as
two open end intervals] -∞, m ^(vorl) - 32766/256 · s ^(vorl) [and [m ^(vorl) + 32766/256 · s ^(vorl) , ∞ [and two interval ^indices , which ^{contain "} value not available" and
Set 'invalid numeric value', and
for all intervals, record the frequencies with which a value falls within the respective intervals.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die mitprotokollierten die Häufigkeiten der Intervall-Besetzungen analysieren und daraus die Werteverteilungsform und die dazu passende endgültige Diskretisierung ableiten.One or more computer program objects of the on-premises client computer can the logged logs the frequencies analyze the interval occupancies and from this the value distribution form and the matching final Derive discretization.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei zumindest annähernder Gleichverteilung zwei offene Endintervalle und dazwischen 2ⁿ – 4 äquidistante Intervalle bilden, bei denen als Obergrenze des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, dass im unteren Endintervall insgesamt etwa 1/2ⁿ aller gültigen Werte liegen und als Breite der 2ⁿ – 4 äquidistanten Intervalle das kleinste Vielfache der vorläufigen Intervallbreite festgelegt ist, welches die Besetzung des verbleibenden oberen Endintervalls auf nicht mehr als 1/2ⁿ aller gültigen Werte anwachsen lässt, wobei gilt: 1 < n ≤ 64.One or more computer program objects of the on-site client computer can form at least approximately equal distribution two open end intervals and between 2 ⁿ - 4 equidistant intervals in which the upper limit of the lower end interval one of the provisional interval limits is set so that in the lower end of total are about 1/2 ^{n of} all valid values and the width of the 2 ⁿ - 4 equidistant intervals is the smallest multiple of the provisional interval width, which increases the occupation of the remaining upper end interval to not more than 1/2 ^{n of} all valid values the following applies: 1 <n ≤ 64.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei zumindest annähernder Exponentialverteilung (Dichtefunktion d(x) = λ·e^–λx) zwei offene Endintervalle und dazwischen 2ⁿ – 4 Intervalle mit abnehmender Breite so festlegen, dass als Obergrenze (g₁) des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, so dass im unteren Endintervall insgesamt etwa 1/2ⁿ – 2 aller gültigen Werte liegen und die Intervallgrenze g_end, bestimmt wird, oberhalb der insgesamt etwa 1/2ⁿ – 2 aller gültigen Werte liegen, wobei λ aus g₁ und g_end bestimt wird zu λ ≔ In(2ⁿ – 2)/(g_end – g₁), und die Wunschbreite (b) des ersten Zwischenintervalls als b ≔ In((2ⁿ – 3)/(2ⁿ – 2))/λ bestimmt wird, wobei gilt: 1 < n ≤ 64.One or more computer program ^{objects of} the on-site client computer can, with at least approximate exponential distribution (density function d (x) = λ * e ^-λx ), ^define two open end intervals and between 2 ⁿ -4 intervals of decreasing width such that the upper limit ( g ₁ ) of the lower end interval one of the provisional interval limits is set so that in the lower end interval a total of about 1/2 ⁿ - 2 of all valid values and the interval limit g _end , is determined, above the total of about 1/2 ⁿ - 2 are all valid values, where λ is determined from g ₁ and g _end to λ ≔ In (2 ⁿ - 2) / (g _end - g ₁ ), and the desired width (b) of the first intermediate interval as b ≔ In ((2 ⁿ - 3) / (2 ⁿ - 2)) / λ, where 1 <n ≤ 64.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können als nächste Intervallgrenze (g₂) eine bestehende vorläufige Intervallgrenze so festlegen, dass der Betrag der Differenz der Obergrenze (g₁) des unteren Endintervalls minus nächster Intervallgrenze (g₂) minus Wunschbreite (b) (|g₂ – g₁ – b|) minimal wird, die Wunschbreite (b) des er sten Zwischenintervalls mit dem Faktor e^λb multipliziert wird und die nächsten Intervalle entsprechend berechnet werden.One or more computer program objects of the on-site client computer can set an existing provisional interval limit as the next interval limit (g ₂ ) such that the amount of the difference of the Upper limit (g ₁ ) of the lower end interval minus the next interval limit (g ₂ ) minus desired width (b) (| g ₂ - g ₁ - b |) is minimal, the ^{desired width} (b) of the first intermediate ^{interval is} multiplied by the factor e ^λb and the next intervals are calculated accordingly.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die am nächsten beim wahren Mittelwert (m) liegende vorläufige Intervallgrenze als Mittelpunkt (m) festlegen, die Standardabweichung (s) so festlegen, dass sie der wahren Standardabweichung möglichst nahe kommt und dass s/64 ein Vielfaches der bestehenden Intervallbreite ist, wobei die bestehenden vorläufigen Intervalle zu größeren neuen Intervallen zusammen gefasst werden, die eine abnehmende Breitenverteilung von s/64, s/32, s/16, ... haben.One or more computer program objects of the on-premises client computer can the closest to true mean (m) lying provisional interval limit as the center (m) set the standard deviation (s) to be the true standard deviation as possible comes close and that s / 64 is a multiple of the existing interval width is, the existing provisional Intervals at larger new intervals be summarized, the decreasing width distribution of s / 64, s / 32, s / 16, ... have.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können boolesche Merkmale als verschiedene Merkmalsausprägungen speichern (,erster valider Wert', ,zweiter valider Wert', ,Wert nicht vorhanden', ,ungültiger Wert').One or more computer program objects of the on-premises client computer can Store Boolean characteristics as different characteristic values ('first valid value', 'second valid value', 'Value not available', ,invalid Value').

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können nominale Merkmale für die Verwendung in der SOM-Karten-Analyse in mehrere boolesche bzw. nummerische 0/1-Merkmale aufspalten.One or more computer program objects of the on-premises client computer can nominal characteristics for the use in the SOM-map analysis in several Boolean or split numeric 0/1 characteristics.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können nominale Merkmale als Position des Merkmals in der Liste aller validen Werte dieses Merkmals speichern und zwei Indexwerte geführt werden, welche ,kein Wert vorhanden' und ,Wert kommt in der Liste gültiger Werte nicht vor' repräsentieren.One or more computer program objects of the on-premises client computer can nominal characteristics as the position of the feature in the list of all valid ones Store values of this feature and keep two index values, which, no value exists' and , Value comes in the list of valid ones Do not represent values before '.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bis zu etwa 100, vorzugsweise bis zu etwa 60, unterschiedliche Nominalwerte einzeln oder als Gruppen einzelner Nominalwerte indizieren.One or more computer program objects of the on-premises client computer can up to about 100, preferably up to about 60, different nominal values index individually or as groups of individual nominal values.

Etwa 10 bis etwa 20 oder 30, vorzugsweise etwa 15 häufigsten Werte können durch ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners zur Auswertung als Einzelwerte ausgewählt werden und alle anderen Werte unter einem Index ,andere' zu einer einzigen Wertgruppe zusammengefasst werden.Approximately From 10 to about 20 or 30, preferably about 15 most frequent values can pass through one or more computer program objects of the on-premises client computer for evaluation are selected as individual values and all others Values below an index, others' too a single value group.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können folgende Schritte ausführen:

– Lesen aller Original-Datensätze in den Arbeitsspeicher des Vor-Ort-Client-Rechner (12) und für jedes nominale Merkmal in den Original-Daten wird ein zum ersten Mal vorkommender Wert in einer Datei ,Wörterbuch' abgespeichert, in der jedem vorkommenden Wert eine Index-Nummer zugeordnet und die jeweilige Vorkommenshäufigkeit erfasst wird;
– sobald eine benutzerdefinierte Schranke, z. B. etwa 30, 1000, oder 65534, unterschiedliche Werte in dem Wörterbuch eingetragen sind,
– Beenden des Einfügens neuer Werte, und
– Eintragen aller danach vorkommenden Werte, die nicht im Wörterbuch auftreten, unter der vorletzten letzten Indexposition, welche ,anderer Wert' wiedergibt, während die letzte Indexposition ,kein Wert vorhanden' wiedergibt.
– Ersetzen der Nominalwerte in dem Wörterbuch durch die Indexnummer mit einer 8 Bit oder 16 Bit Ganzzahl.

One or more on-premises client computer program objects may perform the following steps:

- Read all original data records into the memory of the on-site client computer ( 12 ) and for each nominal feature in the original data, a first occurring value is stored in a dictionary file in which each index value is assigned an index number and the respective occurrence frequency is recorded;
- as soon as a custom barrier, z. B. about 30, 1000, or 65534, different values are entered in the dictionary,
- Stop inserting new values, and
- Entering all subsequently occurring values that do not occur in the dictionary, below the penultimate last index position, which 'other value' reflects, while the last index position, no value present 'reflects.
Replacing the nominal values in the dictionary with the index number with an 8-bit or 16-bit integer.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können folgenden Schritt auszuführen:

– Sortieren des Wörterbuchs nach absteigender Vorkommenshäufigkeit der Einträge.

One or more on-premises client computer program objects may perform the following step:

- Sort the dictionary by decreasing occurrences of entries.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine auf die tatsächlichen Wertehäufigkeiten jedes nominalen Merkmals angepasste, Komprimierung vornehmen, welche 4 Bit oder ein Byte Speicherplatz pro nominalem Merkmal verwendet.One or more computer program objects of the on-premises client computer can one on the actual one Frequent-value make any compression adapted to any nominal feature, which 4 bits or one byte of memory per nominal feature used.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können folgende Schritte auszuführen:

– Erfassen der 14 bzw. 253 häufigsten Werte als separate Werte;
– Zuweisen der Index-Nummer 0 bis 13, bzw. 0 bis 253 an diese Werte im Wörterbuch, vorzugsweise entsprechend ihrer Häufigkeit;
– Zuweisen der Index-Nummer 14 bzw. 254 an allen anderen Werte im Wörterbuch, einschießlich dem bisherigen Index ,anderer Wert';
– Zuweisen der Index-Nummer 15 bzw. 255 an den bisherigen Index ,Wert nicht vorhanden'; und
– Speichern jeder der Index-Nummern auf 4 Bit bzw. 1 Byte Speicherplatz.

- capture the 14 or 253 most frequent values as separate values;
Assigning the index numbers 0 to 13 or 0 to 253 to these values in the dictionary, preferably according to their frequency;
Assign the index number 14 or 254 to all other values in the dictionary, including the to previous index, other value ';
- Assign the index number 15 or 255 to the previous index, value not available '; and
- Store each of the index numbers on 4-bit or 1-byte memory space.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine vorläufige Komprimierung mit mehreren Rechnerkernen auf partitionierten Daten parallel ausführen, globale Statistiken bzw. Wertehäufigkeiten zwischen einzelnen Threads kommunizieren und die endgültige Komprimierung parallelisiert ausführen.One or more computer program objects of the on-premises client computer can a preliminary compression with multiple machine cores running on partitioned data in parallel, global Statistics or value frequencies communicate between individual threads and the final compression run parallelized.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine Anonymisierung der Daten dadurch ausführen, dass die Datenkomprimierung auf dem Vor-Ort-Client-Rechner ausgeführt wird, die komprimierten Daten mit den Intervall-Indizes für nummerische Daten und Wert-Indizes für binäre und nominale Daten an den Datenanalyse-Server (10) übermittelt werden und die ,Dekomprimierungs-Informationen, für nummerische Daten Mittelwerte, Standardabweichungen, Verteilungsform, Minimum, Maximum und für binäre und nominale Daten die Werte-”Wörterbücher”, welche einen Rückschluss vom Werteindex zum tatsächlichen Wert ermöglichen, auf dem Vor-Ort-Client-Rechner (12) gespeichert werden.One or more computer program objects of the on-premises client computer may perform anonymization of the data by performing the data compression on the on-premises client computer, the compressed data with the numeric data interval indexes and value indices for binary and nominal data to the data analysis server ( 10 ) and the decompression information, for numerical data averages, standard deviations, distribution form, minimum, maximum and for binary and nominal data the value "dictionaries" which allow inference of the value index to the actual value on the on-site Client computer ( 12 ) get saved.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können Analysen, Auswertungen, SOM-Modelle auf Basis der Intervall- und Werteindizes der komprimierten Daten erstellen und diese Ergebnisse zurück an den Vor-Ort-Client-Rechner (12) schicken.One or more Analysis Server computer program objects can generate analyzes, evaluations, SOM models based on the interval and value indices of the compressed data, and send those results back to the on-premises client computer ( 12 ).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die Ergebnisse mit den Dekomprimierungsinformationen verknüpfen, wodurch die Ergebnisse mit den ursprünglichen Informationen zur Verfügung stehen.One or more computer program objects of the on-premises client computer can link the results to the decompression information, which the results with the original ones Information available stand.

Ein erstes Computerprogrammprodukt kann ein oder mehrere Computerprogrammobjekte zur Ausführung eines oder mehrerer der vorgenannten Verfahrensschritte auf einem Vor-Ort-Client-Rechner enthalten.One first computer program product may be one or more computer program objects for execution one or more of the aforementioned method steps on a On-site client computer contain.

Ein zweites Computerprogrammprodukt kann ein oder mehrere Computerprogrammobjekte zur Ausführung eines oder mehrerer der vorgenannten Verfahrensschritte auf einem Analyse-Server enthalten.One second computer program product may include one or more computer program objects for execution one or more of the aforementioned method steps on a Analysis server included.

Claims

Electronic data processing system for the analysis of data, comprising - at least one analysis server ( 10 ) and - at least one on-premises client computer ( 12 ), where - the analysis server ( 10 ) is adapted and programmed to implement a self-adapting neuron network to be trained on a large database with a plurality of data sets with many features, wherein - the on-premises client computer ( 12 ) is set up and programmed to subject data supplied to it - data preprocessing and / or - data compression before the data from the on-premises client computer ( 12 ) via an electronic network ( 14 ) to the analysis server ( 10 ), wherein - a data type dependent compression of the data is performed, which causes anonymization of the data by transforming the data into a confidential and a non-confidential part, the non-confidential part of the data to the analysis Server ( 10 ) and the confidential part of the data is stored separately on the on-premises client, and wherein - the analysis server ( 10 ) is adapted and programmed to train the self-adapting neuron network with the received preprocessed / compressed data by repeating the data to the self-adapting neuron network and then performing an analysis around a self-adapting neuron network Model or another data mining analysis result, and where - the analysis server ( 10 ) is set up and programmed to send the self-adapting neuron network model or other data mining analysis result from the analysis server ( 10 ) to the on-premises client computer ( 12 ), and - the on-premises client computer ( 12 ) is adapted and programmed to decompress the data of the self-adapting neuron network model or other data mining analysis result by the analysis server after the arrival of the anonymised analysis result ( 10 ) this anonymized analysis result with the confidential part of the data to a de-anonymized clear text analysis was merged.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is adapted and programmed to adapt the type of data compression to the type or structure of the data by transmitting interval numeric and value indices for binary and nominal data to the data analysis server while the decompression information, namely for numerical data averages, standard deviations, distribution form, minimum, maximum, and for binary and nominal data value "dictionaries", which allow a conclusion from the value index to the actual value, are stored on the on-premises client computer.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to subject the data supplied to it to the extent of up to about 10 gigabytes to several terabytes of data preprocessing and data compression.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to read once the data supplied to it during data preprocessing, and to transform original features contained therein to purely numerical normalized features.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to compress the normalized numerical feature values of the data supplied to it during data compression, so that between two bits and about 8-10 bits is required as storage space per feature expression.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to set a data compression method to be used, depending on a compression error acceptable for the respective analysis task, the compression method to be used and the compression rate to be established depending on the different feature types (boolean, numerical, nominal (textual)).

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed such that the mean prediction error for numerical features is between about 0.01 and about 0.1.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to perform a discretization of the normalized numerical feature values into a number of discrete intervals such that a mean discretization error (| value - discrete value |) is about 0.001.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to map the discretized numeric value to 2 ⁿ - 2 interval indices, as well as states 'value not present' and 'invalid value' and to store n bits of memory, where 1 <n ≤ 64.

An electronic data processing system according to claim 9, wherein the on-premises client computer ( 12 ) is set up and programmed such that interval division is value-distribution-dependent, non-equidistant, wherein preferably the interval widths of the discrete subintervals in areas of high value densities are determined to be particularly low.

An electronic data processing system according to claim 9 or 10, wherein the on-premises client computer ( 12 ) is designed and programmed so that numerical data value distributions following the Gaussian or normal distribution with an average (m) and a standard deviation (s), • approximately 64 intervals of width s / 64 in the ranges [m-s , m [and [m, m + s]; • about 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [; • about 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [; • about 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [; • about 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [; • about 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [; and • about 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [. are fixed.

An electronic data processing system according to any one of claims 9 to 11, wherein the on-premises client computer ( 12 ) is arranged and programmed such that the discrete intervals as a function of mean value (m) and standard deviation (s) are distributed symmetrically about the mean value m, wherein preferably - the interval [m - s / 64, m [the interval position 127, and The interval [m, m + s / 64] has the position 128, and where preferably - the numerical value assigned to each interval is the interval midpoint, and the interval positions 0 and 255 are reserved for invalid or missing values, respectively ,

An electronic data processing system according to any one of claims 9 to 12, wherein the on-premises client computer ( 12 ) is set up and programmed so that numerical features are first mapped to normalized features with mean m = 0.5 and a standard deviation of s = 0.25.

An electronic data processing system according to any one of claims 9 to 13, wherein the on-premises client computer ( 12 ) is set up and programmed to read a fraction of about 1 per thousand to about 10% of the data sets for determining the mean value (m) and the standard deviation (s) of the feature to be discretized, and from this a preliminary mean value for each numerical characteristic in the data (m ^(vorl) ) and a preliminary spread (s ^(vorl) ).

An electronic data processing system according to any one of claims 9 to 14, wherein the on-premises client computer ( 12 ) is set up and programmed to read all data sets and to carry out provisional discretization for all numerical features based on the provisional mean (m ^(f) ) and the preliminary spread (s ^(f) ).

An electronic data processing system according to any of claims 9 to 15, wherein the on-premises client computer ( 12 ) is set up and programmed to have 65532 equidistant intervals of latitude s ^(vorl) / 256 centered about the tentative mean (m ^(vorl) ), and two open end intervals] -∞, m ^(vorl) - 32766/256 · s ^{(ex )} [and [m ^(vorl) + 32766/256 * s ^(vorl) , ∞ [and two interval ^{indices representing} 'value not present' and 'invalid numeric value', and for all intervals to log the frequencies with which Value falls in the respective intervals.

An electronic data processing system according to claim 16, wherein the on-premises client computer ( 12 ) is set up and programmed to analyze the logged records the frequencies of the interval occupancies and to derive therefrom the value distribution form and the matching final discretization.

An electronic data processing system according to any of claims 9 to 17, wherein the on-premises client computer ( 12 ) is set up and programmed to form, with at least approximate equal distribution, two open end intervals and 2 ⁿ - 4 equidistant intervals therebetween, where the upper limit of the lower end interval is one of the provisional interval limits such that a total of approximately 1/2 ⁿ in the lower end interval are all valid values and the width of the 2 ⁿ - 4 equidistant intervals is the smallest multiple of the provisional interval width which causes the occupation of the remaining upper end interval to not exceed 1/2 ^{n of} all valid values, where 1 <n ≤ 64.

An electronic data processing system according to any of claims 9 to 17, wherein the on-premises client computer ( 12 ) is set up and programmed, with at least approximate exponential distribution (density function d (x) = λ · e ^-λx ), to ^define two open end intervals and between 2 ⁿ -4 intervals of decreasing width such that the upper limit (g ₁ ) of the lower end interval one of the provisional interval limits is set so that in the lower end interval a total of about 1/2 ⁿ - 2 of all valid values and the interval limit g _end , is determined to be above the total of about 1/2 ⁿ - 2 of all valid values, where λ is determined from g ₁ and g _end to λ ≔ In (2 ⁿ - 2) / (g _end - g ₁ ), and the desired width (b) of the first intermediate interval as b ≔ In ((2 ⁿ - 3) / ( 2 ⁿ - 2)) / λ, where 1 <n ≤ 64.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to set an existing provisional interval limit as the next interval limit (g ₂ ) such that the magnitude of the upper limit difference (g ₁ ) of the lower end interval minus the next interval limit (g ₂ ) minus desired width (b) (| g ₂ - g ₁ - b |) becomes minimal, multiply the ^{desired width} (b) of the first intermediate interval by the factor e ^λb and the next intervals to calculate accordingly.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to set the tentative interval limit closest to the true mean (m) as the center point (m), to set the standard deviation (s) as closely as possible to the true standard deviation, and s / 64 is a multiple of the existing interval width, with the existing provisional intervals being grouped together at larger new intervals having a decreasing width distribution of s / 64, s / 32, s / 16, ....

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to store Boolean characteristics as different characteristic values ('first valid value', 'second valid value', 'value not available', 'invalid value').

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to split nominal features for use in SOM map analysis into a plurality of Boolean and numeric 0/1 features, respectively.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to store nominal features as the position of the feature in the list of all valid values of that feature and to keep two index values representing 'no value present' and 'value does not appear in the list of valid values'.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is configured and programmed to index up to about 100, preferably up to about 60, different nominal values individually or as groups of individual nominal values.

Electronic data processing system according to preceding claim, wherein the approximately 10 to about 20 or 30, preferably about 15 most common Values for evaluation are selected as single values and all others Values below an index, others' too a single value group.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to carry out the following steps: - reading all the original data sets into the on-premises client computer memory ( 12 ) and for each nominal feature in the original data, store a first occurring value in a dictionary file in which each occurrence value is assigned an index number and the respective occurrence frequency is recorded; - as soon as a custom barrier, z. 30, 1000, or 65534, different values are entered in the dictionary, - ending the insertion of new values, and - entering all thereafter occurring values, which do not occur in the dictionary, below the penultimate last index position, which, other value ' while the last index position, no value exists'. Replacing the nominal values in the dictionary with the index number with an 8-bit or 16-bit integer.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to perform the following step: - Sorting the dictionary according to descending occurrence frequency of the entries.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is configured and programmed to perform compression adapted to the actual value frequencies of each nominal feature using 4 bits or one byte of memory per nominal feature.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to carry out the following steps: acquiring the 14 or 253 most frequent values as separate values; Assigning the index numbers 0 to 13 or 0 to 253 to these values in the dictionary, preferably according to their frequency; Assign the index number 14 or 254 to all other values in the dictionary, including the bishe index, 'other value'; - Assign the index number 15 or 255 to the previous index, value not available '; and - storing each of the index numbers at 4 bits and 1 byte memory, respectively.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to carry out a preliminary compression with multiple computer cores on partitioned data in parallel, to communicate global statistics or value frequencies between individual threads and to execute the final compression in parallel.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to perform the anonymization of the data by performing the data compression on the on-premises client computer, the compressed data using the numeric interval indexes, and binary and nominal value indexes the data analysis server is transmitted during the decompression information, for numeric data averages, standard deviations, distribution form, minimum, maximum and for binary and nominal data the value "dictionaries" which allow a conclusion from the value index to the actual value on the previous Place client computers are stored, and the analysis server creates analyzes, evaluations, SOM models based on the interval and value indices of the compressed data and sends these results back to the on-premises client computer, where is set up and programmed to associate the results with the decompression information which provides the results with the original information.

Method for analyzing data in an electronic data processing system, comprising at least one analysis server ( 10 ) and at least one on-premises client computer ( 12 ), the analysis server ( 10 ) comprises one or more computer program objects for implementing a self-adapting neuron network to be trained on a database having a plurality of records having a plurality of features, the on-premises client computer ( 12 ) has one or more computer program objects for data-preprocessing and / or data compression applied to it to receive the data from the on-premises client computer ( 12 ) to the analysis server ( 10 ), wherein a data-type-dependent compression of the data is performed, which causes anonymization of the data by transforming the data into a confidential and a non-confidential part, the non-confidential part of the data at the analysis Server ( 10 ) and the confidential part of the data is stored separately on the on-premises client, and one or more analysis server computer program objects ( 10 ) train the self-adapting neuron network with the received preprocessed / compressed data by repeatedly presenting the data to the self-adapting neuron network and one or more computer program objects of the analysis server ( 10 ) then perform an analysis to create a self-adapting neuron network model or other data mining analysis result, and one or more analysis server computer program objects ( 10 ) sending the self-adapting neuron network model or other data mining analysis result from the analysis server ( 10 ) to the on-premises client computer ( 12 ) and one or more computer program objects of the on-premises client computer ( 12 ) subject the data of the self-adapting neuron network model or other data mining analysis result to decompression after the anonymized analysis result has arrived from the analysis server ( 10 ) this anonymized analysis result is merged with the confidential part of the data to a de-anonymised plain text analysis result.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) adapt the type of data compression to the type or structure of the data by transmitting interval indices for numerical data and value indices for binary and nominal data to the data analysis server, while decompression information, namely for numerical data averages, standard deviations , Distribution form, minimum, maximum, and for binary and nominal data value "dictionaries" that allow inference from the value index to the actual value are stored on the on-premises client computer.

Method according to Claim 33, in which one or more computer program objects of the analysis server ( 10 ) train the self-adapting neuron network with the received anonymized data as many times as necessary until an out-of-convergence network state is established that adequately represents the data.

Method according to Claim 33, in which one or more computer program objects of the analysis server ( 10 ) present the data to the self-adapting neuron network about 100 to about 200 times.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) submitting the data supplied to it in the amount of up to about 10 gigabytes to several terabytes of data preprocessing and data compression.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) once read the data supplied to it during data preprocessing, and transform original features contained therein to purely numerical normalized features.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) Compress the normalized numerical feature values of the data supplied to it in the data compression, so that between two bits and about 8-10 bits is required as storage space per feature expression.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) determine a data compression method to be used in dependence on a compression error acceptable for the respective analysis task, the compression method to be used and the compression rate to be achieved being determined depending on the different feature types (Boolean, numeric, nominal (textual)).

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) set the mean prediction error for numerical features between about 0.01 and about 0.1.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) perform a discretization of the normalized numerical feature values into a number of discrete intervals such that a mean discretization error (| value - discrete value |) is about 0.001.

The method of claim 42, wherein one or more computer program objects of the on-premises client computer ( 12 ) map the discretized numerical value to 2 ⁿ - 2 interval indices, as well as 'value not present' and 'invalid value' and store it on n bit of memory where 1 <n ≤ 64.

A method according to claim 42 or 43, wherein one or more computer program objects of the on-premises client computer ( 12 ) set the interval division depending on the distribution of distribution, non-equidistant, wherein preferably the interval widths of the discrete subintervals in areas of high value densities are determined to be particularly low.

A method according to claim 42, 43 or 44, wherein one or more computer program objects of the on-premises client computer ( 12 ) for numerical data value distributions following the Gaussian or normal distribution with an average value (m) and a standard deviation (s), • approximately 64 intervals of width s / 64 in the ranges [m-s, m [and [m, m + s]; • about 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [; • about 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [; • about 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [; • about 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [; • about 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [; and • about 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [. establish.

Method according to one of Claims 40 to 45, in which one or more computer program objects of the on-site client computer ( 12 ) distribute the discrete intervals as a function of mean (m) and standard deviation (s) symmetrically about the mean m, where preferably - the interval [m - s / 64, m [the interval position 127, and - the interval [m, m + s / 64 [which has position 128, and where preferably - the numerical value assigned to each interval is the interval midpoint, and the 0 or 255 interval positions are reserved for invalid or missing values, respectively.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) first map numerical features to normalized features with mean m = 0.5 and a standard deviation of s = 0.25.

Method according to one of Claims 40 to 47, in which one or more computer program objects of the on-site client computer ( 12 ) for determining the mean value (m) and the standard deviation (s) of the feature to be discretized from a fraction of about 1 per thousand to about 10% of the records read from this for each numerical feature in the data a provisional average (m ^(vorl) ) and calculate a preliminary spread (s ^(f) ).

Method according to one of Claims 40 to 48, in which one or more computer program objects of the on-site client computer ( 12 ) to read all data sets and to carry out provisional discretization for all numerical features based on the provisional mean (m ^(vorl) ) and the preliminary spread (s ^(vorl) ).

Method according to one of Claims 40 to 49, in which one or more computer program objects of the on-site client computer ( 12 ) 65532 equidistant intervals of width s ^(vorl) / 256 centered around the provisional mean (m ^(vorl) ), and two open end intervals] -∞, m ^(vorl) - 32766/256 · s ^(vorl) [and [m ^{( readl)} + 32766/256 · s ^(vorl) , ∞ [and two interval ^indices that specify 'value not present' and 'invalid numeric value', and for all intervals record the frequencies with which a value falls within the respective intervals.

The method of claim 50, wherein one or more computer program objects of the on-premises client computer ( 12 ) that analyze the frequencies of the interval occupations and derive therefrom the value distribution form and the corresponding final discretization.

Method according to one of Claims 40 to 51, with one or more computer program objects of the on-site client computer ( 12 ) form at least approximately equal distribution two open end intervals and between 2 ⁿ - 4 equidistant intervals, in which the upper limit of the lower end interval one of the provisional interval limits is set so that in the lower end of a total of about 1/2 ^{n of} all valid values and width of the 2 ⁿ - 4 equidistant intervals is set the least multiple of the provisional interval width, which makes the occupation of the remaining upper end interval increase to not more than 1/2 ^{n of} all valid values, where 1 <n ≤ 64.

Method according to one of claims 40 to 52, in one or more computer program objects of the on-site client computer ( 12 ) with at least approximate exponential distribution (density function d (x) = λ · e ^-λx ) ^define two open end intervals and between them 2 ⁿ -4 intervals of decreasing width so that the upper limit (g ₁ ) of the lower end interval is one of the provisional interval limits is such that, in the lower end interval, there are a total of about 1/2 ⁿ -2 of all valid values and the interval limit g _end , determined, is above the total of about 1/2 ⁿ -2 of all valid values, where λ is g ₁ and g _end is determined as λ ≔ In (2 ⁿ - 2) / (g _end - g ₁ ), and the desired width (b) of the first intermediate interval as b ≔ In ((2 ⁿ - 3) / (2 ⁿ - 2)) / λ is determined, where 1 <n ≤ 64.

Method according to the preceding claim, in which one or more computer program objects of the on-site client computer ( 12 ) define as the next interval limit (g ₂ ) an existing provisional interval limit such that the magnitude of the difference of the upper limit (g ₁ ) of the lower end interval minus the next interval limit (g ₂ ) minus desired width (b) (| g ₂ - g ₁ - b |) is minimized, the ^{desired width} (b) of the first intermediate ^{interval is} multiplied by the factor e ^λb , and the next intervals are calculated accordingly.

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) set the tentative interval boundary closest to the true mean (m) as the center (m), set the standard deviation (s) as close as possible to the true standard deviation, and s / 64 be a multiple of the existing interval width; existing interim intervals may be grouped together at larger new intervals having a decreasing width distribution of s / 64, s / 32, s / 16, ....

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) store Boolean characteristics as different characteristic values ('first valid value', 'second valid value', 'value not available', 'invalid value').

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) nominal characteristics for use in the SOM-Kar split ten analysis into several Boolean or numeric 0/1 features.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) store nominal features as the position of the feature in the list of all valid values of that feature and keep two index values representing 'no value present' and 'value does not appear in the list of valid values'.

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) up to about 100, preferably up to about 60, different denominations individually or as groups of individual denominations.

A method according to the preceding claim, wherein from about 10 to about 20 or 30, preferably about 15 most frequently Values for evaluation are selected as single values and all others Values below an index, others' too a single value group.

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) perform the following steps: - Read all original data records into the on-premises client computer memory ( 12 ) and for each nominal feature in the original data, storing a first occurring value in a dictionary file in which each occurring value is assigned an index number and the respective occurrence frequency is detected; - as soon as a custom barrier, z. 30, 1000, or 65534, different values are entered in the dictionary, - ending the insertion of new values, and - entering all thereafter occurring values, which do not occur in the dictionary, below the penultimate last index position, which, other value ' while the last index position, no value exists'. Replacing the nominal values in the dictionary with the index number with an 8-bit or 16-bit integer.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) to perform the following step: - Sort the dictionary according to decreasing frequency of occurrences of the entries.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) make a compression adapted to the actual value frequencies of each nominal feature using 4 bits or one byte of memory per nominal feature.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) carry out the following steps: acquiring the 14 or 253 most frequent values as separate values; Assigning the index numbers 0 to 13 or 0 to 253 to these values in the dictionary, preferably according to their frequency; - Assign the index number 14 or 254 to all other values in the dictionary, including the previous index 'different value'; - Assign the index number 15 or 255 to the previous index, value not available '; and - storing each of the index numbers at 4 bits and 1 byte memory, respectively.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) perform preliminary compression with multiple machine cores in parallel on partitioned data, communicate global stats between individual threads, and perform the final compression in parallel.

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) perform the anonymization of the data by performing data compression on the on-premises client computer, the compressed data with the numeric interval indexes and binary and nominal value indexes to the data analysis server ( 10 ) and the decompression information for numeric data mean values, standard deviations, distribution form, minimum, maximum and for binary and nominal data the value "dictionaries" which allow a conclusion from the value index to the actual value on the On-site client machine ( 12 ) save.

Method according to one of the preceding method claims, wherein one or more computer program objects of the analysis server ( 10 ) Create analyzes, evaluations, SOM models based on the interval and value indices of the compressed data and send these results back to the on-premises client computer ( 12 ).

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) link the results to the decompression information, providing the results with the original information.

Computer program product containing one or more computer program objects for executing one or more of the aforementioned method steps on an on-site client computer ( 12 ).

Computer program product containing one or more computer program objects for executing one or more of the aforementioned method steps on an analysis server ( 10 ).