DE102018129366A1

DE102018129366A1 - System for processing and storing data requiring archiving

Info

Publication number: DE102018129366A1
Application number: DE102018129366.6A
Authority: DE
Inventors: Falk Borgmann; Michael Brünker
Original assignee: Deepshore GmbH
Current assignee: Deepshore GmbH
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2020-05-28

Abstract

Die vorliegende Erfindung betrifft allgemein das Gebiet des elektronischen Dokumentenmanagements und der elektronischen Archivierung sowie im Speziellen ein computerimplementiertes Verfahren zur Verarbeitung und revisionssicheren Speicherung von archivierungspflichtigen Daten, insbesondere im Rahmen von ECM- und EIM-Anwendungen.The present invention relates generally to the field of electronic document management and electronic archiving, and in particular to a computer-implemented method for processing and revision-proof storage of data requiring archiving, in particular in the context of ECM and EIM applications.

Description

Als Dokumentenmanagement-System oder auch DMS wird in der Regel ein System zur elektronischen Verwaltung von Dokumenten bezeichnet. Orientiert man sich an der Definition der „Association for Information and Image Management“, so ist ein Dokumentenmanagementsystem definiert als ein Computer-System bzw. eine Software zur Aufbewahrung, Verwaltung und Nachverfolgung elektronischer Dokumente. Im klassischen Sinne haben diese elektronischen Dokumente ihren Ursprung in papierbasierten Dokumenten, die mit Hilfe eines Scanners digitalisiert wurden. Heutzutage verwalten Dokumentenmanagement-Systeme auch eine Vielzahl weiterer elektronisch erstellter Dokumente. Ein DMS kann Unternehmen somit also z. B. bei der effizienten Verwaltung von Dokumenten durch eine beschleunigte Auffindbarkeit unterstützen.A document management system or DMS is usually a system for the electronic management of documents. If you follow the definition of the "Association for Information and Image Management", a document management system is defined as a computer system or software for storing, managing and tracking electronic documents. In the classic sense, these electronic documents originate from paper-based documents that were digitized using a scanner. Nowadays, document management systems also manage a large number of other electronically created documents. A DMS can therefore be used by companies e.g. B. support the efficient management of documents by accelerated findability.

Einen Schritt weiter gehen die sogenannten Enterprise-Content-Management-Systeme (ECM). Diese übernehmen zwar ähnliche Aufgaben wie auch ein DMS, sind jedoch nicht nur auf die Verwaltung elektronischer Dokumente beschränkt. Ziel eines ECM-Systems ist die Speicherung, Sicherung, Bereitstellung und Verteilung von Informationen sowie die generelle Zusammenführung von unstrukturierten und strukturierten Informationen an einem zentralen Ort. Zur Vereinfachung können ECM-Systeme als übergeordnete Plattformen zu einem DMS angesehen werden. Ein DMS bildet also einen eigenen untergeordneten Teil eines ECM-Systems.So-called Enterprise Content Management Systems (ECM) go one step further. Although they perform similar tasks as a DMS, they are not limited to the management of electronic documents. The aim of an ECM system is the storage, backup, provision and distribution of information as well as the general merging of unstructured and structured information in a central location. To simplify matters, ECM systems can be viewed as superordinate platforms to a DMS. A DMS therefore forms its own subordinate part of an ECM system.

Mittlerweile lässt sich jedoch eine weitere Form von ECM abbilden - das Enterprise-Information-Management. Hierbei stellen DMS und ECM die Basis dar, auf der das Enterprise-Information-Management beruht. Der Unterschied zu einem ECM-System ist, dass ein solches System zusätzlich die Bereiche der Zusammenarbeit (Kollaboration) und der generellen Verbesserung in der Prozessgestaltung innerhalb des Unternehmens unterstützt. So können auch komplexe Workflows und Arbeitsprozesse mit Hilfe eines EIM-Systems optimiert werden.However, another form of ECM can now be mapped - enterprise information management. DMS and ECM are the basis on which enterprise information management is based. The difference to an ECM system is that such a system additionally supports the areas of collaboration (collaboration) and the general improvement in process design within the company. Even complex workflows and work processes can be optimized with the help of an EIM system.

Auch wenn sich die Begrifflichkeiten in ihrer Definition eigentlich unterscheiden, so werden sie im üblichen Sprachgebrauch und auch im Ragmen der vorliegenden Schutzrechtsanmeldung meist synonym verwendet.Even if the terms actually differ in their definition, they are mostly used interchangeably in common usage and also in the framework of the present application for industrial property rights.

Enterprise Content Management und Enterprise Information Management gehören für viele Unternehmen zu den wichtigsten Geschäftsprozessen. Damit verbunden ist die Anforderung, nahezu sämtliche Daten über die Abläufe im Unternehmen revisionssicher zu erfassen, aufzubewahren und konform zu gesetzlichen Regelunvgen, insbesondere Löschfristen zu verwenden.
Heutige ECM/EIM-Systeme (welche eine Archivierungskomponente beinhalten) bedienen sich alle eines artverwandten Datenmodells.
Dabei werden Metadaten zu einem Archivobjekt (Dokument) in einer relationalen Datenbank verwaltet, wobei das Archivobjekt selbst auf einem CAS (Content Addressed Storage) oder WORM (write once read multiple) Medium gespeichert wird. Metadaten können technischer Natur sein, wie z.B. das Speicherdatum oder die Löschfrist, aber auch fachlicher Natur, wie z.B. die Kopfdaten einer Rechnung. Die Archivsoftware (oder ECM/EIM-Software) selbst steuert dabei die Verwaltung der Meta-Informationen innerhalb des Systems. Wichtig ist hierbei, dass aufgrund der relationalen Datenbanksoftware sich die Suchlogik innerhalb der ECM-Systeme immer nur auf „Kopfdaten“ beschränkt (z.B. Datum und Nummer einer Rechnung), da hohe Datenvolumina die relationale Technologie schnell an ihre technische & wirtschaftliche Leistungsgrenze führen. Inhaltliche Datenanalysen wie in einem DWH (Data-Warehouse) sind bei diesen Systemen kaum oder gar nicht realisierbar, bzw. bedienen sich Technologien/Mechanismen, die keine 100% verlässlichen Trefferergebnisse zusichern. Um Transaktionen innerhalb der Systeme abzusichern, wird auf die ACID-Eigenschaft (atomicity, consistency, isolation und durability) der Datenbank vertraut, welche in Kombination mit dem Archiv-Storage die Ablage sicher und nachvollziehbar gestaltet und das System somit zu einem elektronischen Archiv macht.For many companies, enterprise content management and enterprise information management are among the most important business processes. Associated with this is the requirement to record almost all data about the processes in the company in an audit-proof manner, to keep them and to use them in accordance with legal regulations, in particular deletion periods.
Today's ECM / EIM systems (which contain an archiving component) all use a related data model.
Metadata on an archive object (document) is managed in a relational database, the archive object itself being stored on a CAS (Content Addressed Storage) or WORM (write once read multiple) medium. Metadata can be of a technical nature, such as the storage date or the deletion period, but also of a technical nature, such as the header data of an invoice. The archive software (or ECM / EIM software) itself controls the management of the meta information within the system. It is important here that due to the relational database software, the search logic within the ECM systems is always limited to "header data" (e.g. date and number of an invoice), since high data volumes quickly lead the relational technology to its technical & economic performance limit. Content data analysis as in one DWH (Data warehouse) are hardly or not at all possible with these systems, or use technologies / mechanisms that do not guarantee 100% reliable results. In order to secure transactions within the systems, the database relies on the ACID property (atomicity, consistency, isolation and durability), which, in combination with the archive storage, makes the storage safe and traceable, thus making the system an electronic archive.

Um trotz der beschriebenen Einschränkungen der relational arbeitenden herkömmlichen Systeme inhaltliche Datenanalysen vornehmen zu können, existieren heute spezielle DWH- oder Analysesysteme, z. B. SAP S/4HANA oder die Teradata Analytics Plattform und auch noch einer Vielzahl weiterer, die Auswertungen von fachspezifischen Daten ermöglichen. Der Einsatz dieser Systeme befindet sich aber immer in einem Spannungsfeld zwischen Kosten, Geschwindigkeit und Datenqualität. Der Regelfall ist es, dass teure In-Memory Ansätze (S/4 HANA) nur für einen kleinen Teil der Gesamtdaten und für ein sehr begrenztes Zeitfenster zum Einsatz kommen. Oftmals werden historische Daten aus Kosten- und Performancegründen verdichtet oder ausgelagert, wodurch sich eine präzise Aussagekraft über die Zeit verschlechtert, da Informationen durch Komprimierungsverfahren verloren gehen. Gängige NoSQL-Stores hingegen liefern bei hochvolumigen Verarbeitungen keine technische ACID-Zusicherung für die Prozessierung einzelner Transaktionen. Für analytische Use-cases ist es dabei zusätzlich üblich, den Transport der Daten in einem stapelorientierten „Best-Effort“ Modus zu übermitteln, was sich einerseits auf die zeitnahe Verarbeitungsmöglichkeit, sowie die Qualität der Daten an sich auswirkt.In order to be able to carry out content data analyzes despite the described limitations of the relationally operating conventional systems, special DWH or analysis systems exist today, e.g. B. SAP S / 4HANA or the Teradata Analytics platform and also a variety of others that enable evaluations of subject-specific data. However, the use of these systems is always in a conflict between costs, speed and data quality. As a rule, it is that expensive in- Memory approaches (S / 4 HANA) are only used for a small part of the total data and for a very limited time window. Historical data is often condensed or outsourced for reasons of cost and performance, which deteriorates its accuracy over time, as information is lost through compression processes. Common NoSQL stores, on the other hand, do not provide technical ACID assurances for the processing of individual transactions for high-volume processing. For analytical use cases, it is also customary to transmit the data in a stack-oriented "best effort" mode, which has an impact on the timely processing options and the quality of the data itself.

Aufgabe der vorliegenden Erfindung ist es daher, ein Verfahren zur Verarbeitung und Speicherung von archivierungspflichtigen Daten bereitzustellen, das die Nachteile des Standes der Technik vermeidet.
Insbesondere besteht Bedarf an einer ECM/EIM-Lösung, die drei wichtige Aspekte in sich vereint, nämlich

I. Eine revisionssichere WORM-Speicherung von produzierten Daten, um existierende Anforderungen des Gesetzgebers zur Langzeitarchivierung und Auswertungen der steuerrelevanten Daten zu ermöglichen.
II. Eine Nutzbarmachung der auf diesem Wege gespeicherten Daten für analytische Use-Cases mit kostenoptimierten NoSQL und Map-Reduce Technologien. Darüber hinaus auch die Verknüpfung mit existierenden Datenquellen und 3rd-Party Technologien zur Schaffung von höherer analytischer Qualität und damit Wettbewerbsvorteilen.
III. Beide zuvor genannten Anforderungen sollen auf einer horizontal skalierbaren und kostengünstigen IT-Infrastruktur verschmelzen bzw. betrieben werden. Damit verbunden ist ein Verzicht auf ein aufwändiges Batch-Processing und redundante Datenversorgungen einer Archiv-Infrastruktur und einer DWH/Analytics-Infrastruktur (konzeptionelle & technische Verbindung eines Archives und eines DWH).

The object of the present invention is therefore to provide a method for the processing and storage of data requiring archiving, which avoids the disadvantages of the prior art.
In particular, there is a need for an ECM / EIM solution that combines three important aspects, namely

I. Audit-proof WORM storage of produced data to enable existing legislative requirements for long-term archiving and evaluation of tax-relevant data.
II. A utilization of the data stored in this way for analytical use cases with cost-optimized NoSQL and map-reduce technologies. In addition, the link to existing data sources and 3rd party technologies to create higher analytical quality and thus competitive advantages.
III. Both of the aforementioned requirements are to be merged or operated on a horizontally scalable and cost-effective IT infrastructure. This means that there is no need for complex batch processing and redundant data supplies for an archive infrastructure and a DWH / analytics infrastructure (conceptual & technical connection of an archive and one DWH ).

Die erfindungsgemäße Aufgabe wird durch die Bereitstellung eines Systems zur Verarbeitung und Speicherung von archivierungspflichtigen Daten gelöst, das die Dienste ( )

a) Access Service (Zugang),
b) Analytics Service (Analyse),
c) Indexing Service (Indizierung),
d) Data Verification Service (Datenverifizierung) und
e) Storage Service (Speicherung)

umfasst, wobei die Dienste in einer verteilten Infrastruktur ablaufen und der

- Access Service durch einen Webservice, der
- Analytics Service durch einen Cluster Computing Service, der
- Indexing Service durch einen NoSQL Store, der
- Data Verification Service durch eine Blockchain und der
- Storage Service durch ein verteiltes Dateisystem (Distributed File System) realisiert wird.

The object of the invention is achieved by the provision of a system for processing and storing data requiring archiving, which the services ( )

a) Access Service,
b) Analytics Service (analysis),
c) Indexing Service (indexing),
d) Data Verification Service and
e) Storage Service

includes, where the services run in a distributed infrastructure and the

- Access service through a web service that
- Analytics service through a cluster computing service that
- Indexing service through a NoSQL store that
- Data verification service through a blockchain and the
- Storage service is implemented by a distributed file system.

Vorzugsweise wird der Cluster Computing Service mittels des Programmodells „Map Reduce“ verwirklicht.The cluster computing service is preferably implemented using the “Map Reduce” program model.

Das erfindungsgemäße System eignet sich generell um Daten im Einklang mit rechtlichen Anforderungen der Bundesrepublik Deutschland (GoBD) zu archivieren. Über die Speicherung unter dem Gesichtspunkt der Beweiswerterhaltung bietet es aber auch die Möglichkeit inhaltliche Analysen und Auswertungen auf der Datenbasis eines Archivbestandes zu generieren. Insbesondere gilt dies für den Einsatz im stationären Handel, wie auch im Versandhandel und Online-Handel.
Vorteilhafterweise und entgegen der technischen Implementierungen heutiger Archiv- und DWH-Systeme, die siloartig nebeneinander existieren, handelt es sich bei dem erfindungsgemäßen System nicht um mehrere Datenpersistenzen in einer Koexistenz, sondern um eine logische Entität, innerhalb eines Unternehmens. Technisch ist das erfindungsgemäße System in der Lage, als Shared-Infrastructure von und für mehrere Unternehmen gleichzeitig genutzt zu werden. Dabei geht das System über den üblichen Ansatz hinaus, technische Instanzen innerhalb einer Cloudinfrastruktur zu betreiben. Vielmehr ist die Lösung technisch so aufgebaut, dass es in verschiedenen Cloudinfrastrukturen gleichzeitig lauffähig ist (multi cloud aware), aber bei Bedarf auch lokal (on premises), im eigenen Rechenzentrum betrieben werden kann. Durch diese neue Eigenschaft kann auf die Abhängigkeiten von bestimmten Infrastrukturlieferanten verzichtet werden. Hinter dem erfindungsgemäßen System befindet sich keine dedizierte Hardware, wie dies bei allen heute bekannten Anbietern der Fall ist. Dieses Konzept wird daher im Rahmen der vorliegenden Erfindung als „Infrastructure-Agnostic-Service“ (kurz IAS) bezeichnet.The system according to the invention is generally suitable for archiving data in accordance with legal requirements of the Federal Republic of Germany (GoBD). By saving from the point of view of preservation of evidence, however, it also offers the possibility of generating content analyzes and evaluations based on the data of an archive inventory. This applies in particular to use in brick-and-mortar retail, as well as in mail order and online retail.
Advantageously and contrary to the technical implementations of today's archive and DWH systems, which exist side by side in a silo-like manner, the system according to the invention is not a matter of multiple data persistence in one coexistence, but a logical entity within a company. Technically, the system according to the invention is able to be used as a shared infrastructure by and for several companies at the same time. The system goes beyond the usual approach of operating technical instances within a cloud infrastructure. Rather, the solution is technically structured so that it can run in different cloud infrastructures at the same time (multi cloud aware), but can also be operated locally (on premises) in your own data center if required. This new property allows for the dependencies on certain infrastructure suppliers are dispensed with. There is no dedicated hardware behind the system according to the invention, as is the case with all providers known today. This concept is therefore referred to in the context of the present invention as an “infrastructure agnostic service” (IAS for short).

Das erfindungsgemäße System zerlegt den in klassischen Dokumentenmanagement-Systemen vorgesehenen Datenspeicher (normalerweise WORM oder CAS-Systeme) in zwei Services (Verification & Storage) und kann bei Bedarf (optional) mit NoSQL-Datenbanken zusammenarbeiten, was einen Verzicht auf ACID Transaktionen impliziert. Dies resultiert jedoch in der Notwendigkeit, Transaktionen nach einem streng monotonen Konzept zu verarbeiten. Darüber hinaus müssen diese Regeln über beide Services hinweg anwendbar sein. Das erfindungsgemäße System stellt erstmals eine befriedigende Antwort auf diese Herausforderung bereit.The system according to the invention breaks down the data storage provided in classic document management systems (usually WORM or CAS systems) into two services (verification & storage) and can (optionally) work together with NoSQL databases, which implies a waiver of ACID transactions. However, this results in the need to process transactions according to a strictly monotonous concept. In addition, these rules must be applicable across both services. The system according to the invention provides for the first time a satisfactory answer to this challenge.

Um einen streng monotonen Ablauf zu erreichen, bedienen sich die Services des erfindungsgemäßen Systems des Modells der „Transaktionsklammer“, welche die verschiedenen technischen Services in einen fachlichen Ablaufkontext setzt. Eine Transaktionsklammer ist somit die fachliche Sicherheit, welche es innerhalb der verteilten Systeme ermöglicht, ein Objekt (z.B. eine Rechnung) während des gesamten Lebenszyklus derart zu überwachen, dass eine Manipulation ausgeschlossen ist. Dieses neuartige Konzept der Transaktionsklammer lässt sich dabei in vier Teile (Unterservices) unterteilen:

- Write
- Read Repair
- Housekeeping Repair
- Audit Trail

In order to achieve a strictly monotonous process, the services of the system according to the invention use the “transaction bracket” model, which puts the various technical services in a professional process context. A transaction bracket is therefore the technical security that makes it possible within the distributed systems to monitor an object (eg an invoice) during the entire life cycle in such a way that manipulation is impossible. This new concept of the transaction bracket can be divided into four parts (sub-services):

- Write
- Read repair
- Housekeeping repair
- Audit trail

Hierbei nimmt das erfindungsgemäße System beim Schreiben (Write) ein Objekt mit dem oben genannten Access Service entgegen. Dazu werden beim Dokumenteneingang (Archivierung, ) folgende fachlichen Schritte gekapselt:

1. Nimm ein Dokument entgegen (z.B. eine Rechnung) und errechne mindestens einen Hash vom Typen SHA256 des Contents, sowie eine eindeutig GUID (Globally Unique Identifier). In Abhängigkeit der Sicherheitsstufe, kann auch ein zweiter, mathematisch unabhängiger Hash (keine SHA-256), verwendert werden.
2. Schreibe die GUID, das Datum der Verarbeitung und den Hash in eine Blockchain, inkl. einem „Typ“, wobei ein Typ ein Ereignis innerhalb des Systems ist, in diesem Fall „AddFile“. Typen sind hier definiert als fachliche Ereignisse, die das System dokumentiert. Das kann ein Speichervorgang, ein Löschvorgang, eine Reparatur oder Rohdatenumverteilungen sein.

The system according to the invention accepts an object with the above-mentioned access service when writing. For this purpose, when receiving documents (archiving, ) encapsulated the following technical steps:

1. Take a document (eg an invoice) and calculate at least one hash of type SHA256 of the content, as well as one GUID (Globally Unique Identifier). Depending on the security level, a second, mathematically independent hash (no SHA-256) can also be used.
2. Write the GUID , the date of processing and the hash in a blockchain, including a "type", where a type is an event within the system, in this case "AddFile". Types are defined here as technical events that the system documents. This can be a storage process, a deletion process, a repair or raw data redistribution.

Bsp. Datenmodell Insert:


 „apiVersion“: „v04“,
 „type“: „AddFile“,
 „file“: {
 „id: „<uuid>“,
 „name": „<filename without path>“,
      „sha256": „<sha256>“,
 „size": „<bytes>“

3. (gleichzeitig mit Schritt 2) → Lege die Rohdaten in das angeschlossene File System und verifiziere danach den Hash aus dem File System gegen den Hash aus der Blockchain (nachdem der Block mit der neuen Transaktion durch den Cluster verifiziert wurde).
4. Gib dem Client ein OK oder nicht OK zurück.

Example data model insert:


 "ApiVersion": "v04",
 "Type": "AddFile",
 "File": {
 "Id:"<uuid>","Name":"<filename without path>",
      "Sha256": "<sha256>",
 "Size": "<bytes>"

3. (simultaneously with step 2nd ) → Put the raw data in the connected file system and then verify the hash from the file system against the hash from the blockchain (after the block has been verified by the cluster with the new transaction).
4. Return the client an OK or not OK.

Unter einer „Blockchain“ wird eine geordnete Datenstruktur verstanden, welche eine Mehrzahl von miteinander verketteten Datenblöcken umfasst. Insbesondere wird unter einer Blockchain eine Datenbank verstanden, deren Integrität, d.h. Sicherung gegen nachträgliche Manipulation, durch Speicherung eines Prüfmerkmals, wie etwa eines Hashwertes, des vorangehenden Datensatzes in dem jeweils nachfolgenden Datensatz gesichert ist. Das Prüfmerkmal ist dabei dem Inhalt des vorangehenden Datensatzes zugeordnet und charakterisiert diesen eindeutig. Wird der Inhalt des vorangehenden Datensatzes verändert, so erfüllt dieser nicht mehr das Prüfmerkmal, woraus die Veränderung ersichtlich wird. Im Falle von bekannten Blockchain-Strukturen wird etwa jeder Block der Blockchain eindeutig durch einen HashWert identifiziert und referenziert einen Vorgängerblock in der Blockchain, dessen Hash-Wert er umfasst. Für Beispiele einer Blockchain vergleiche https://en.wikipedia.org/wiki/Block_chain_(database) und „Mastering Bitcoin“, Chapter 7, The Blockchain, Seite 161 ff. Das Konzept der Blockchains wurde beispielsweise im Jahre 2008 in einem White Paper unter dem Pseudonym Satoshi Nakamoto im Kontext der Kryptowährung Bitcoin beschrieben („Bitcoin: Peer-to-Peer Electronic Cash System“ (https://bitcoin.org/bitcoin.pdf)). In diesem Ausführungsbeispiel enthält jeder Block der Blockchain in seinem Header den Hash des gesamten vorherigen Blockheaders. Somit wird die Reihenfolge der Blöcke eindeutig festgelegt und es entsteht eine Kettenstruktur. Durch die so implementierte Verkettung der einzelnen Blöcke miteinander wird erreicht, dass ein nachträgliches Modifizieren vorangegangener Blöcke nicht möglich ist, ohne auch alle nachfolgenden Blöcke ebenfalls zu modifizieren.A “blockchain” is understood to mean an ordered data structure which comprises a plurality of data blocks which are linked to one another. In particular, a blockchain is understood to mean a database whose integrity, ie protection against subsequent manipulation, by storing a test feature, such as a hash value, of the preceding data record in the subsequent data record Data record is saved. The test characteristic is assigned to the content of the preceding data record and uniquely characterizes it. If the content of the previous data record is changed, it no longer fulfills the test characteristic, which shows the change. In the case of known blockchain structures, each block of the blockchain is uniquely identified by a hash value and references a previous block in the blockchain, the hash value of which it includes. For examples of a blockchain, see https://en.wikipedia.org/wiki/Block_chain_(database) and "Mastering Bitcoin", Chapter 7, The Blockchain, page 161 ff. The concept of blockchains was, for example, in 2008 in a white paper under the pseudonym Satoshi Nakamoto in the context of the cryptocurrency Bitcoin ("Bitcoin: Peer-to-Peer Electronic Cash System" (https://bitcoin.org/bitcoin.pdf)). In this exemplary embodiment, each block of the blockchain contains the hash of the entire previous block header in its header. The order of the blocks is thus clearly defined and a chain structure is created. As a result of the chaining of the individual blocks with one another in this way, subsequent modification of previous blocks is not possible without also modifying all subsequent blocks.

Das oben beschriebene Verfahren zur Kapselung beim Dokumenteneingang stellt sicher, dass ein gespeicherter Datensatz unverändert und nachvollziehbar in dem verteilten System abgelegt wurde und es zum Zeitpunkt der Ablage mindestens n (n=natürliche Zahl > 1) Kopien in dem Cluster gibt. Die Zahl n muss mindestens 3 sein, was mit dem Sicherheitskonzept (Separation of Power) zusammenhängt, welches zu einem späteren Zeitpunkt noch erläutert wird. Aufgrund der unveränderbaren Natur der Blockchain, kann nun der Rohdatensatz auf dem File System nicht mehr verändert (manipuliert werden), ohne dass dies bemerkt werden könnte.
Zweiter Bestandteil der Transaktionsklammer ist der Read Repair, also das Abrufen ( ) eines Dokumentes welcher im erfindungsgemäßen System gleichzeitig eine Datenvalidierung und möglicherweise eine Reparatur darstellt. Im Normalfall ist der Ablauf wie folgt:

1 . Ein Client fordert ein Dokument über eine GUID
2. Der Access Service sucht zur GUID den passenden Hash aus der Blockchain und vergleicht diesen gegen die Rohdaten des File Systems.
3. Ist der Hash aus der Blockchain gleich dem kalkulierten Hash des File Stores, wird das Dokument ausgeliefert. Sollte sich keine korrekte Kopie des Rohdatensatzes innerhalb des Storage befinden, erhält der Client eine entsprechende Information.

The above-described process for encapsulation when receiving documents ensures that a stored data record has been stored unchanged and comprehensibly in the distributed system and at least at the time of the storage n (n = natural number> 1) there are copies in the cluster. The number n must at least 3rd be related to the safety concept (Separation of Power), which will be explained later. Due to the unchangeable nature of the blockchain, the raw data record on the file system can no longer be changed (manipulated) without this being noticed.
The second component of the transaction bracket is read repair, i.e. retrieval ( ) a document which in the system according to the invention simultaneously represents data validation and possibly a repair. The normal procedure is as follows:

1 . A client requests a document about a GUID
2. The Access Service is looking for GUID the appropriate hash from the blockchain and compares it to the raw data of the file system.
3. If the hash from the blockchain is equal to the calculated hash of the file store, the document is delivered. If there is no correct copy of the raw data set in the storage, the client receives the corresponding information.

Sollte der Access Service einen Datensatz finden, dessen Hash nicht dem Hashwert innerhalb der Blockchain entspricht, wird das System selbst den korrupten oder gar fehlenden Datensatz mit einer korrekten Version ersetzen. Dieser Mechanismus wird im Rahmen der vorliegenden Erfindung als Read Repair bezeichnet. Um dem rechtlichen Anspruch der Nachvollziehbarkeit gerecht zu werden, schreibt das System ein Repair-Event als eigenes Ereignis in die Blockchain. So entsteht je Archivobjekt ein Audit Trail innerhalb der Blockchain, der nicht verändert werden kann. Auf diese Weise bleibt es gegenüber einem Prüfer nachvollziehbar, wann und welches Dokument ggf. auf dem File Store defekt war und ob es repariert wurde. Der Audit Trail ist ein wichtiger Bestandteil der Lösung, um eine vollständige Compliance im Sinne der Nachvollziebarkeit zu erreichen.If the access service finds a data record whose hash does not match the hash value within the blockchain, the system itself will replace the corrupt or even missing data record with a correct version. In the context of the present invention, this mechanism is referred to as read repair. In order to meet the legal requirement of traceability, the system writes a repair event as a separate event in the blockchain. This creates an audit trail within the blockchain for each archive object that cannot be changed. In this way, it remains traceable to an auditor when and which document was possibly defective in the file store and whether it was repaired. The audit trail is an important part of the solution in order to achieve full compliance in terms of traceability.

Bsp. Datenmodell Repair:


 „apiVersion": „v1“,
 „type": „RepairFile“,
 „file“: {
 „id“: „<uuid>“,
 „name“: „<filename without path>“,
 „sha256“: „<sha256>“,
 „size“: „<bytes>“
},
 „repairInfo“: {
      „okay“: <storageName where file was healthy>,
      „notOkay“: <storageName where file was broken>}

E.g. data model repair:


 "ApiVersion": "v1",
 "Type": "RepairFile",
 "File": {
 "Id": "<uuid>",
 "Name": "<filename without path>",
 "Sha256": "<sha256>",
 "Size": "<bytes>"
},
 "RepairInfo": {
      "Okay": <storageName where file was healthy>,
      "NotOkay": <storageName where file was broken>}

Die beiden beschriebenen Mechanismen werden vom System als zusammengehöriges Event zu einem bestimmten Objekt erkannt und könnten auf Wunsch über den Access-Service entsprechend dargestellt werden ( ). The two mechanisms described are recognized by the system as a related event for a specific object and could be displayed accordingly via the Access Service ( ).

Als vierten Bestandteil der Transaktionsklammer nutzt das erfindungsgemäße System den sogenannten Housekeeping Repair. Ähnlich wie beim Read Repair handelt es sich dabei um einen identischen Mechanismus, jedoch läuft dieser zeitlich gesteuert und permanent als sogenannte Hintergrund-Task. Vereinfacht gesagt handelt es sich hier um einen Job, der permanent die Hashwerte der Blockchain mit den Rohdaten des File Stores abgleicht, um sicher zu stellen, dass keine Daten verloren gehen oder unbemerkt korrumpiert sind. Werden defekte Daten entdeckt, repariert sich das System selbständig, genauso, wie schon beim zuvor beschriebenen Read Repair inkl. eines entsprechend Audit Trails je Dokument innerhalb der Blockchain.As a fourth component of the transaction bracket, the system according to the invention uses the so-called housekeeping repair. Similar to the read repair, this is an identical mechanism, but it runs in a time-controlled manner and permanently as a so-called background task. Put simply, this is a job that constantly compares the hash values of the blockchain with the raw data of the file store to ensure that no data is lost or is corrupted unnoticed. If defective data is discovered, the system repairs itself, just as with the previously described read repair, including a corresponding audit trail per document within the blockchain.

Der erwähnte Audit Trail dient über die Dokumentation von Reparaturen hinaus auch für die Speicherung anderer relevanter Events. So wird beispielsweise das vollständige Löschen von Rohdaten, nach dem Ablauf der rechtlichen Aufbewahrungsfrist, ebenfalls als ein eigenes Event innerhalb der Blockchain verankert (s. g. logisches Löschen). Auf diese Weise enthält das erfindungsgemäße System anonymisierte Auditdaten (DSGVO-konform) in einem unveränderbaren und nicht manipulierbaren Speicher, der Blockchain. Dies konnten in der Form bisher nur Systeme mit angeschlossenen CAS/WORM-Speichersystemen gewährleisten.In addition to the documentation of repairs, the aforementioned audit trail also serves to save other relevant events. For example, the complete deletion of raw data after the legal retention period has expired is also anchored as a separate event within the blockchain (see logical deletion). In this way, the system according to the invention contains anonymized audit data (GDPR-compliant) in an unchangeable and non-manipulable memory, the blockchain. Up to now, this could only be guaranteed in the form of systems with connected CAS / WORM storage systems.

Das Konzept der Transaktionsklammer ermöglicht es also grundsätzlich, in einem verteilten System ohne Datenbank dennoch compliant Daten zu verwalten. Dies wird durch die Orchestrierung eines redundanten FileStores (Storage Service) und einer Blockchain (Data Verification Service) erreicht. Jedoch liegt die Kompetenz des Gesamtsystems auch auf der Datenanalyse, also auch auf den klassischen DWH-Funktionen, was es erforderlich macht, diese Möglichkeit für vorhandene Rohdaten in einer Datenbank zur Verfügung zu stellen.The concept of the transaction bracket basically enables compliant data to be managed in a distributed system without a database. This is achieved through the orchestration of a redundant file store (storage service) and a blockchain (data verification service). However, the competence of the overall system is also based on data analysis, i.e. also on the classic DWH functions, which makes it necessary to provide this option for existing raw data in a database.

Das Konzept der Transaktionsklammer dient dazu einen Zustand herzustellen, der aus Compliance-Sicht mit der Nutzung eines „WORM-Speichers“ vergleichbar ist. Es gibt jedoch neben der rein technischen Ablage von Dokumenten, auch die Notwendigkeit einige besondere rechtlichen Anforderungen an eine revisionssichere Datenhaltung vorzusehen. Dabei handelt es sich um die physikalische, bzw. geographische Lagerung von Archivobjekten. Beispielsweise darf nach aktueller Rechtslage ein deutsches Unternehmen welches eigene Lokationen in China betreibt, keine für den chinesischen Fiskus relevanten Daten außer Landes schaffen. Somit müssen alle Rohdaten innerhalb des chinesischen Hoheitsgebiets verbleiben, jedoch gleichzeitig innerhalb Deutschlands im Rahmen eines Jahresabschlusses berücksichtigt und „ordnungsgemäß“ - nach deutschem Recht - verwaltet/archiviert werden.The concept of the transaction bracket is used to create a state that, from a compliance perspective, is comparable to the use of a "WORM memory". However, in addition to the purely technical filing of documents, there is also the need to provide some special legal requirements for audit-proof data storage. This is the physical or geographical storage of archive objects. For example, according to the current legal situation, a German company that operates its own locations in China is not allowed to create data outside the country that is relevant for the Chinese tax authorities. This means that all raw data must remain within the Chinese territory, but at the same time be taken into account in Germany in the context of annual financial statements and managed / archived "properly" - according to German law.

Zur Auflösung dieses Dilemmas stellt die vorliegende Erfindung das „Floating-Data-Domain-Modell“ als Teil des erfindungsgemäßen Systems bereit, wobei dieses Modell in der Lage ist, die rechtlichen Anforderungen an eine Datenhaltung in den technischen Betrieb zu übersetzen. Dabei erlaubt es, auf Basis definierter Regeln, die Speicherung von Rohdaten gezielt zu steuern. Zum Beispiel kann dafür Sorge getragen werden, dass eine bestimmte Anzahl unterschiedlicher Storageklassen (z.B. zweimal Cloud + einmal on Premises [lokal]) unter Einhaltung definierter Geolokationen (z.B. „darf nur auf den Europäischen Kontinent betrieben werden“) automatisch für den entsprechenden Rohdatensatz verwendet wird.
Eine Storageklasse ist dabei individuell konfigurierbar und n-fach in unterschiedlichen Use cases nutzbar.
Der Service des erfindungsgemäßen Systems setzt dabei auf den Geolokationskonzepten von Cloudanbietern wie Amazon auf, bei dem die physikalische Lokation von Instanzen bekannt ist, bzw. bestimmt werden kann. Nutzt ein Unternehmen beispielsweise die AWS-Cloud (in Deutschland & USA), die AZURE-Cloud (in China) und ein eigenes Rechenzentrum (in Deutschland), wird dem Service initial bekannt gegeben, welche Systemkomponenten des Clusters in welcher Geolokation verfügbar sind.
In einem zweiten Schritt werden Storageklassen je Use case definiert.To solve this dilemma, the present invention provides the “floating data domain model” as part of the system according to the invention, this model being able to translate the legal requirements for data storage into technical operation. It allows you to control the storage of raw data in a targeted manner based on defined rules. For example, it can be ensured that a certain number of different storage classes (e.g. twice cloud + once on premises [local]) is automatically used for the corresponding raw data set while observing defined geolocations (eg "may only be operated on the European continent") .
A storage class is individually configurable and can be used n times in different use cases.
The service of the system according to the invention is based on the geolocation concepts of cloud providers such as Amazon, in which the physical location of instances is known or can be determined. For example, if a company uses the AWS cloud (in Germany & USA), the AZURE cloud (in China) and its own data center (in Germany), the service is informed which system components of the cluster are available in which geolocation.
In a second step, storage classes are defined for each use case.

Beispiel:

Storageklasse Default - keine Einschränkung für Datenhaltung
Storageklasse 1 - Daten dürfen nur in China abgelegt werden
Storageklasse 2 - Daten dürfen nur auf dem europäischen Festland gespeichert werden
Storageklasse n - ...

Example:

Storage class Default - no restriction for data storage
Storage class 1 - Data may only be stored in China
Storage class 2nd - Data may only be stored on mainland Europe
Storage class n - ...

Speichert eine Applikation nun Daten in dem erfindungsgemäßen System, wird in den Metadaten der Archivobjekte mitgeteilt, um welche Storageklasse es sich handelt. Ist keine Storageklasse explizit benannt, wird das System immer die Klasse „Default“ wählen, bei der die Verteilung keiner bestimmten Reglementierung unterliegt. Liefert zum Beispiel ein Buchhaltungssystem aus China also Daten mit dem Hinweis „Storageklasse 1“ in den Metadaten des zu archivierenden Objektes, speichert das erfindungsgemäße System diese angelieferten Daten auch ausschließlich in den Clusterinstanzen, welche durch die AZURE-Cloud bereitgestellt wurden. If an application now stores data in the system according to the invention, the storage objects are communicated in the metadata of the archive objects. If no storage class is explicitly named, the system will always select the "Default" class, for which the distribution is not subject to any specific regulations. If, for example, an accounting system from China delivers data with the note “Storage class 1” in the metadata of the object to be archived, the system according to the invention also stores this delivered data exclusively in the cluster instances that were provided by the AZURE cloud.

Eine weitere wichtige Aufgabe des Floating-Data-Domain-Modells ist es bei sogenannten Storage Provisionierungen (Einführung neuer Cluster-Instanzen innerhalb eines Rechenzentrums) dafür zu sorgen, dass Daten sich korrekt, entsprechend der Vorgaben durch Storageklassen, umverteilen. Wird zum Beispiel eine neue Storage-Instanz in einem europäischen Rechenzentrum aktiviert, darf der Cluster keine chinesischen Daten automatisch zu dieser Instanz verteilen, deren Daten mit der „Storageklasse 1“ gekennzeichnet sind.
Erweitert man dieses Beispiel mit dem Szenario, dass eine bestehende Instanz durch eine neue ersetzt werden soll, übernimmt das System automatisch die „Migration“ der Daten von Instanz „alt“ zu „neu“. Durch diesen permanente Daten-„Flow“ (deshalb: Floating Data), welcher durch die Nutzung der Standardreplikationsmechanismen der verteilten Softwarekomponenten in Kombination mit dem bereits eingeführten Konzept der Transaktionsklammer erfolgt, kann sichergestellt werden, dass keine Daten bei der Übertragung verloren gehen. Aufwändige Archiv-Migrationen gehören damit der Vergangenheit an, da der Nachweis einer unveränderten Vollständigkeit quasi ein Standardbestandteil des erfindungsgemäßen Systems ist, was in der hier beschriebenen Form bisher nicht existierte. Another important task of the floating data domain model is so-called storage provisioning (introduction of new cluster instances within a data center) to ensure that data is redistributed correctly in accordance with the requirements of storage classes. If, for example, a new storage instance is activated in a European data center, the cluster may not automatically distribute Chinese data to this instance, the data of which with the “storage class 1 " Marked are.
If you extend this example with the scenario that an existing instance is to be replaced by a new one, the system automatically takes over the "migration" of the data from instance "old" to "new". This permanent data “flow” (therefore: floating data), which is achieved through the use of the standard replication mechanisms of the distributed software components in combination with the already introduced concept of the transaction bracket, can ensure that no data is lost during the transfer. Complex archive migrations are a thing of the past, since the proof of an unchanged completeness is a standard component of the system according to the invention, which has not existed in the form described here.

Das Floating-Data-Domain-Modell dient also als Datenverwaltungsservice innerhalb des erfindungsgemäßen Systems, um die Verteilung der Rohdaten zu organisieren. Hier können mittels Storageklassen Geozonen von unterschiedlichen Cloudprovidern zusammen genutzt werden. Zusammen mit der Tranksaktionsklammer kann ferner sichergestellt werden, dass keine Daten während einer Replizierung verändert werden oder verloren gehen.The floating data domain model thus serves as a data management service within the system according to the invention in order to organize the distribution of the raw data. Here, geo zones can be used together by different cloud providers using storage classes. Together with the transaction clamp, it can also be ensured that no data is changed or lost during a replication.

Einen weiteren wichtigen Aspekt der des erfindungsgemäßen Systems bildet die gewählte IT-Infrastruktur und deren Organisation. Um das Konzept vollumfänglich im Compliance-Umfeld nutzbar zu machen, müssen insbesondere 4 organisatorische Herausforderungen in Abhängigkeit betrachtet werden:

1. Bestimmung der Anzahl Blockchain-Instanzen
2. Bestimmung der Anzahl File Store-Instanzen
3. Bestimmung der Anzahl disjunkter Rechenzentren
4. Bestimmung der Anzahl von Administratoren & Rechteverteilung zur Systemkonfiguration

Another important aspect of the system according to the invention is the selected IT infrastructure and its organization. In order to make the concept fully usable in the compliance environment, in particular 4th organizational challenges are considered depending:

1. Determine the number of blockchain instances
2. Determine the number of file store instances
3. Determine the number of disjoint data centers
4. Determination of the number of administrators and allocation of rights for system configuration

Wie schon zuvor erläutert, bildet das „Infrastructure-Agnostic-Service-Konzept“ eine entscheidende Säule des erfindungsgemäßen Systems.
Durch den Einsatz von Docker Containern in einer Kubernetes Umgebung, können beispielsweise benötigte Softwarekomponenten an beliebigen Orten der Welt und unterschiedlichen Cloudservices angesteuert werden.
Docker ist eine bekannte Open-Source-Software zur Isolierung von Anwendungen mit Containervirtualisierung.
Kubernetes ist ein gleichfalls bekanntes Open-Source-System zur Automatisierung der Bereitstellung, Skalierung und Verwaltung von Container-Anwendungen, das ursprünglich von Google entworfen und an die Cloud Native Computing Foundation gespendet wurde. Es zielt darauf ab, eine Plattform für das automatisierte Bespielen, Skalieren und Warten von Anwendungscontainern auf verteilten Hosts zu liefern. Kubernetes unterstützt eine Reihe von Container-Tools, einschließlich Docker.As previously explained, the “infrastructure agnostic service concept” forms a crucial pillar of the system according to the invention.
By using Docker Containers in a Kubernetes environment, the required software components can be controlled anywhere in the world and different cloud services.
Docker is a well-known open source software for isolating applications with container virtualization.
Kubernetes is an equally well-known open source system for automating the deployment, scaling and management of container applications, which was originally designed by Google and donated to the Cloud Native Computing Foundation. It aims to provide a platform for automated application, scaling and maintenance of application containers on distributed hosts. Kubernetes supports a number of container tools, including Docker.

Diese Verteilung von Instanzen bedingt nicht nur Vorteile in der Datenhaltung, bzw. der inhärenten Möglichkeiten der Skalierung, sondern ist von immanenter Bedeutung für die Datensicherheit. Dabei ist im Rahmen der Implementierung darauf zu achten, dass es mindestens 3 (oder mehr) verschiedene administrative Zugänge auf jeweils eine Teilmenge der Clusterinstanzen gibt. Diese 3 Zugänge, dürfen keinesfalls einen gleichzeitigen Zugriff auf identische Infrastrukturkomponenten erhalten. Nutzt ein Unternehmen beispielsweise Computerleistung aus der AWS-Cloud, der AZURE-Cloud und betreibt ein eigenes Rechenzentrum, dann dürfen die Administratoren der jeweiligen Cloud/RZ nur jeweils ihren eigenen vorgeschriebenen Bereich administrieren.This distribution of instances does not only result in advantages in data storage or the inherent possibilities of scaling, but is of immanent importance for data security. In the course of the implementation, care must be taken that at least 3rd (or more) different administrative accesses to a subset of the cluster instances. These 3rd Access, must never have simultaneous access to identical infrastructure components. For example, if a company uses computer services from the AWS cloud, the AZURE cloud and operates its own data center, then the administrators of the respective cloud / data center can only administer their own prescribed area.

Beispiel example

Admin 1 → AWS cloud
Admin 2nd → AZURE cloud
Admin 3rd → On premises

Zu beachten ist, dass innerhalb einer Cloud-Infrastruktur dennoch mehrere Administratoren gleichzeitig arbeiten könnten, so lange deren Zugriff auf einzelne Instanzen sich nicht überschneidet. Denkbar wäre dies zum Beispiel bei der Verwaltung unterschiedlicher Geo-Zonen.It should be noted that several administrators could still work at the same time within a cloud infrastructure, as long as their access to individual instances does not overlap. This would be conceivable, for example, when managing different geo-zones.

Eine sinnvolle Anzahl von Instanzen des Verification Services (Blockchain) basiert auf den theoretischen Grundlagen des Byzantinischen Fehlermodells. Nach den Regeln eines „Byzantine-Fault-Tolerance“-Modells ist ein verteiltes System dann nicht angreifbar, wenn mehr als zwei Drittel der an einem Cluster beteiligten Knoten korrekt arbeiten. Umgekehrt betrachtet, dürfen also nur weniger als ein Drittel der involvierten Instanzen feindlicher Natur sein. Diesem Modell entsprechen sollten also mindestens 4, 7 oder 10 Instanzen (oder dem mathematischen Modell folgend mehr Instanzen) der Blockchain betrieben werden. Dabei ist die Konfiguration im Rahmen der vorliegenden Erfindung so zu wählen, dass jeweils mindestens 50% der beteiligten Mining-Instanzen einen neuen Block errechnen/validieren können. Entgegen der reinen Theorie basiert das erfindungsgemäß genutzte Modell auf einer Kombination der mathematisch optimalen Instanzanzahl „Z“, entsprechend dem byzantinischen Modell und den organisatorischen Erkenntnissen zur administrativen Trennung der Aufgaben. Somit beträgt die optimale Anzahl der Instanzen Z+1 für den Fall das Z eine grade Zahl ist oder genau Z, für den Fall das Z eine ungerade Zahl ist.
Anhand eines Beispiels von Z=4 käme man also auf 5 Instanzen im Cluster (Z+1 → 4+1=5). Kombiniert mit den erforderlichen 50% zum erfolgreichen Minen eines Blocks resultiert dies in der Zahl 3. Dies ist demnach genau die minimale Anzahl von Instanzen, welche intakt sein müssen und sich auch gegenseitig „sehen“ können, um erfolgreich neue Daten zu verarbeiten. Diese Kombination einer bestimmten Anzahl von Instanzen und die Konfiguration des Mining-Quorums im Cluster ist wichtig, um ein Blockchain-Fork zu vermeiden, für den Fall das einzelne Netzwerksegmente im Fehlerfall voneinander getrennt werden sollten. Ein Fork bedingt auch immer ein theoretisches Risiko von Datenverlusten, was in einem Archivsystem natürlich nicht sein darf.A reasonable number of instances of the verification service (blockchain) is based on the theoretical foundations of the Byzantine error model. According to the rules of a "Byzantine Fault Tolerance" model, a distributed system is not vulnerable if more than two thirds of the nodes involved in a cluster work correctly. Conversely, less than a third of the instances involved can be hostile in nature. So this model should at least correspond 4th , 7 or 10th Instances (or more instances following the mathematical model) of the blockchain are operated. The configuration in the context of the present invention is to be selected such that at least 50% of the mining instances involved can each calculate / validate a new block. Contrary to pure theory, the model used according to the invention is based on a combination of the mathematically optimal number of instances “Z”, corresponding to the Byzantine model and the organizational findings for the administrative separation of the tasks. Thus the optimal number of instances is Z + 1 for the case that Z is an even number or exactly Z for the case that Z is an odd number.
Using an example of Z = 4, you would get 5 instances in the cluster (Z + 1 → 4 + 1 = 5). Combined with the 50% required to successfully mine a block, this results in the number 3rd . Accordingly, this is exactly the minimum number of instances which must be intact and can also “see” one another in order to successfully process new data. This combination of a certain number of instances and the configuration of the mining quorum in the cluster is important in order to avoid a blockchain fork, in the event that individual network segments should be separated from each other in the event of an error. A fork always entails a theoretical risk of data loss, which of course is not allowed in an archive system.

Im Falle des vorhergehenden Beispiels mit 5 Blockchain-Instanzen und 3 Administratoren würde sich also folgende Systemkonstellation ergeben:
Beispiel

Admin 1 → AWS → 1 Blockchain Instanz
Admin 2 → AZURE → 2 Blockchain Instanzen
Admin 3 → On Premises → 2 Blockchain Instanzen

In the case of the previous example with 5 Blockchain instances and 3rd Administrators would have the following system constellation:
example

Admin 1 → AWS → 1 Blockchain instance
Admin 2nd → AZURE → 2nd Blockchain instances
Admin 3rd → On Premises → 2nd Blockchain instances

Im Gegensatz zum technischen Verifizieren von neuen Blöcken sollte ein konfigurativer Eingriff in die Blockchain selbst nur mittels eines Administator-Quorums = x möglich sein (wobei für x gilt 1>x>0,5).In contrast to the technical verification of new blocks, a configurative intervention in the blockchain itself should only be possible using an administrator quorum = x (where x> 1> x> 0.5).

Im letzten Schritt wird noch eine optimale Zahl von File Store-Instanzen gesucht. Entsprechend der Rechenzentrums-Entitäten reicht es im Minimalfall, die Rohdaten an jeweils einer Entität vorzuhalten. Das bedeutet, in unserem Beispiel würden 3 File Stores betrieben werden. In der Anzahl der vorhandenen Kopien spiegelt sich gleichermaßen das Sicherheitsniveau des Systems vor Datenverlust. Sinnvollerweise wird die Anzahl der Kopien immer um den Faktor 2 erhöht. In dem vorangegangen Beispiel also 3, 6 oder gar 9 Kopien der Rohdaten, welche auf die Rechenzentrumsentitäten gleichverteilt werden.In the last step, an optimal number of file store instances is searched for. Depending on the data center entities, it is sufficient in the minimum case to hold the raw data at one entity at a time. That means in our example 3rd File stores are operated. The number of copies available also reflects the security level of the system against data loss. It makes sense to increase the number of copies by a factor 2nd elevated. So in the previous example 3rd , 6 or even 9 Copies of the raw data, which are evenly distributed across the data center entities.

Beispielexample

Admin 1 → AWS → 1 Blockchain instance → 1 File store
Admin 2nd → AZURE → 2nd Blockchain instances → 1 File store
Admin 3rd → On Premises → 2nd Blockchain instances → 1 FileStore

Das beschriebene Konzept wird im Rahmen der vorliegenden Erfindung „Separation of Power“ genannt. Es bietet Schutz gegen mehrere potentielle Gefahren:

1.) Der Ausfall eines Knotens oder eines gesamten Rechenzentrums beeinträchtigt nicht das Gesamtsystem.
2.) Ein Administrator allein kann das System nicht manipulieren oder versehentlich beeinträchtigen.

The concept described is called “separation of power” in the context of the present invention. It offers protection against several potential dangers:

1.) The failure of a node or an entire data center does not affect the overall system.
2.) An administrator alone cannot manipulate or inadvertently affect the system.

Um die Blockchain (Verification Service) theoretisch zu manipulieren, müsste ein Angreifer mindestens zwei Administratoren und/oder deren zu administrierende Clusterinstanzen unter seine Kontrolle bringen. Im Vergleich zu heutigen Archivsystemen, die auf relationalen Technologien basieren, bildet dieses Verfahren eine deutlich höhere Hürde gegen Angriffe.In order to theoretically manipulate the blockchain (verification service), an attacker would have to bring at least two administrators and / or the cluster instances to be administered under his control. Compared to today's archive systems, which are based on relational technologies, this procedure forms a significantly higher hurdle against attacks.

Eine weitere wichtige Funktion des erfindungsgemäßen Systems ist die Nutzbarmachung der gesamten Datentiefe, der in der zuvor beschriebenen Weise gespeicherten Informationen für analytische Use cases. Dabei werden kostenoptimierten NoSQL- und Map-Reduce-Technologien verwendet, ohne das - im Gegensatz zu den Systemen aus dem Stand der Technik - dabei auf die Informationstiefe der Originalbelege in Archivqualität verzichtet werden muss.Another important function of the system according to the invention is the utilization of the entire data depth, of the information stored in the manner described above for analytical use cases. Here, cost-optimized NoSQL and map-reduce technologies are used, without - in contrast to the systems from the prior art - having to do without the depth of information of the original documents in archive quality.

Das erfindungsgemäße System ist in der Lage, dieses Dilemma durch eine methodische Kombination verschiedener Services aufzulösen. Somit ist es erstmals möglich, die maximale Datenqualität eines Archives auch für analytische Zwecke zu nutzen, ohne dabei auf reduzierte und ungenaue Datenbasen angewiesen zu sein.
Durch eine Erweiterung des Transaktionsklammer-Modells kann das System bei der Entgegenahme von strukturierten Daten diese auch inhaltlich parsen und die Ergebnisse in eine Datenbank (Indexing Service) schreiben. Diese Verarbeitung hat notwendigerweise nichts mit der Herstellung des Archiv-Status abgelegter Informationen zu tun. D.h. der Aufbau einer Indexschicht kann auch unabhängig von der revisionssicheren Speicherung der Rohdaten zu einem späteren Zeitpunkt erfolgen. Wichtig ist lediglich, dass der Indexing Service auf den Rohdaten im Archiv-Zustand basiert und nicht schon vorher aufgebaut wird, da nur so sichergestellt werden kann, dass die extrahierten Informationen tatsächlich auf einem unveränderlichen Datenbestand aufsetzen.The system according to the invention is able to solve this dilemma by a methodical combination of different services. It is now possible for the first time to use the maximum data quality of an archive for analytical purposes without having to rely on reduced and inaccurate databases.
By extending the transaction bracket model, the system can also parse the content of structured data and write the results to a database (indexing service). This processing necessarily has nothing to do with the creation of the archive status of stored information. This means that an index layer can also be set up at a later date, regardless of the revision-proof storage of the raw data. It is only important that the indexing service is based on the raw data in the archive state and is not set up beforehand, since this is the only way to ensure that the extracted information is actually based on an unchangeable database.

Um dies zu erreichen gibt es zwei wesentliche Unterschiede zu herkömmlichen Datalake oder DWH Anwendungen. Diese Paradigmen werden in einem Modell zusammengefasst, das im Rahmen der vorliegenden Erfindung als „CDE-Modell“ bezeichnet wird. Durch das CDE-Modell kann sichergestellt werden, dass Daten des Indexing Services genau dem Informationsgehalt des „Archives“ entsprechen. Dabei kann auch mit Hilfe des Verification Services in einem verteilten DB-System gewährleistet werden, das nicht unbemerkt Datenveränderungen passieren, bzw. könnten Manipulation sogar zeitnah korrigiert werden. Das CDE-Modell wird im Folgenden erläutert:

Zum einen ist es im erfindungsgemäßen System obligatorisch, Daten in einem „read after write“ Verfahren an den Indexing Service zu übergeben. Dadurch wird sichergestellt, dass jeder Datensatz auch wirklich geschrieben wurde, bevor der zuständige Prozess davon ausgehen darf, dass der Schreibvorgang korrekt und vollständig erfolgt ist. Der Ablauf kann also wie folgt aussehen:
- 1. Parsen der Daten
- 2. Indexing Struktur aufbauen
- 3. Struktur in Indexing Service (Datenbank) schreiben
- 4. Lesen der Indexing Struktur nach Beendigung des Schreibvorganges
- 5. Vergleiche das Ergebnis vom Lesevorgang mit dem gelieferten Input
- 6. Wenn gleich → Transaktion erfolgreich beenden; Wenn nicht gleich -> Transaktion wiederholen

To achieve this there are two main differences to conventional data lake or DWH Applications. These paradigms are summarized in a model which is referred to in the context of the present invention as the “CDE model”. The CDE model can be used to ensure that data from the Indexing Service correspond exactly to the information content of the "Archives". The Verification Service in a distributed DB system can also ensure that data changes do not go unnoticed, or manipulation could even be corrected promptly. The CDE model is explained below:

On the one hand, it is mandatory in the system according to the invention to transfer data to the indexing service in a “read after write” process. This ensures that every data record has actually been written before the responsible process can assume that the writing process has been carried out correctly and completely. The process can look like this:
- 1. Parsing the data
- 2. Build indexing structure
- 3. Write structure in indexing service (database)
- 4. Reading the indexing structure after the writing process has ended
- 5. Compare the result of the reading process with the supplied input
- 6. If immediately → complete the transaction successfully; If not immediately -> repeat transaction

Mit dem „read after write“ Vorgang wird auch eine logische Verknüpfung zwischen den Rohdaten im Storage Service/Verification Service und des Indexing Services hergestellt. Pro Archiv-Datensatz gibt es dabei einen Primärschlüssel der aus einer Kombination von Hash & GUID besteht (je Dokument) und über alle Prozesse im erfindungsgemäßen System gültig ist. Somit ist das System auch zu jedem Zeitpunkt in der Lage, einzelne Datensätze oder auch den gesamten Datenbestand des NoSQL-Stores aus den vorhandenen Rohdaten im Archivspeicher wiederherzustellen. Durch das zuvor skizzierte Transaktionsklammerverfahren ist dabei sichergestellt, dass sich die Datenbasis immer in einem vollständigen & unveränderbaren Zustand befindet.The "read after write" process also creates a logical link between the raw data in the Storage Service / Verification Service and the Indexing Service. There is one primary key per archive data record, which consists of a combination of hash & GUID exists (per document) and is valid for all processes in the system according to the invention. This means that the system is able to restore individual data records or the entire database of the NoSQL store from the existing raw data in the archive store at any time. The transaction bracket procedure outlined above ensures that the database is always in a complete and unchangeable state.

Sind die gewünschten Indexdaten einmal vollständig und korrekt im Indexing Service abgelegt, muss nur noch die Frage beantwortet werden, wie eine gezielte oder versehentliche Manipulation der Datenbank ausgeschlossen werden kann. Um dies zu erreichen, ist der zweite Baustein des CDE-Modells entwickelt worden.Once the desired index data has been completely and correctly stored in the Indexing Service, the only question left to answer is how to deliberately or accidentally manipulate the database can be excluded. To achieve this, the second component of the CDE model has been developed.

Die Datenbank wird dabei zu einem Zeitpunkt t betrachtet. Zum Zeitpunkt t wurden bereits n Datensätze in die Datenbank geschrieben. Nun werden alle n-Datensätze zum Zeitpunkt t selektiert und das Ergebnis in einen Hashwert umgerechnet. Man erhält den Hashwert aller aggregierten Datenbankeinträge in der Datenscheibe 1 zum Zeitpunkt t ( ). Dieser Hashwert wird in einen Block der Blockchain als eigene Transaktion gespeichert und ist somit durch die Transaktionsklammer geschützt.The database is doing one at a time t considered. At the time t have already been n Records written to the database. Now all n records at the time t selected and the result converted into a hash value. The hash value of all aggregated database entries in the data slice is obtained 1 at the time t ( ). This hash value is stored in a block of the blockchain as a separate transaction and is therefore protected by the transaction bracket.

Im nächsten Schritt wird das Prozedere wiederholt, wobei die Laufzeitumgebung mittels eines Zufallsgenerators aus einem fest definierten Intervall festlegt, wann genau der folgende Zeitpunkt t+1 ( ) ist. Somit ist t+1 nicht durch einen Menschen exakt vorherbestimmbar. In der nächsten Datenscheibe 2, also von t bis t+1, wurde die erste neue Transaktion n+1 und x weiter Transaktionen geschrieben, welche nicht Bestandteil der ersten Datenscheibe 1 zum Zeitpunkt t waren. Alle neuen Transaktionsdaten, also:
[(n+1)] _(t+1) mit n,t∈N
bis
[(n+1)] _(t+1)+x_(t+1) mit n, t,x∈N
werden dann selektiert und das Ergebnis in einen weiteren Hashwert umgerechnet, welcher ebenfalls in der Blockchain gespeichert wird. Die Blockchaintransaktion enthält neben den Hashwerten auch einen speziellen Type=„DBHash“ inkl. der genauen Zeitscheibeninformation.In the next step, the procedure is repeated, the runtime environment using a random generator to determine from a fixed interval exactly when the following time t + 1 ( ) is. Thus, t + 1 cannot be predicted exactly by a human being. In the next data slice 2nd , so from t to t + 1, the first new transaction was written n + 1 and x further transactions which are not part of the first data slice 1 at the time t were. All new transaction data, so:
[(n + 1)] _ (t + 1) with n, t∈N
to
[(n + 1)] _ (t + 1) + x_ (t + 1) with n, t, x∈N
are then selected and the result is converted into another hash value, which is also stored in the blockchain. In addition to the hash values, the blockchain transaction also contains a special type = "DBHash" including the exact time slice information.

Dieser neue Hashwert Nr. 2 und der Hashwert Nr. 1, werden in die Blockchain (Verification Service) gespeichert und spiegeln nun jeweils den exakten Stand der Daten zum Zeitpunkt t bzw. aller nach dem Zeitpunkt t neu hinzugeführten Daten bis zum Zeitpunkt t+1 wieder. Ein Vergleich der Rohdaten aus der Datenbank ist ab jetzt jederzeit gegen den Wert der Blockchain möglich und läuft per Zufallsprinzip, je Datenscheibe, zeitgesteuert im Hintergrund. Sollte sich bei dem Vergleich eine Abweichung ergeben, kann aus den Rohdaten des Storage Services der Datenbankbestand im Indexing Service wiederhergestellt werden.This new hash value no. 2nd and the hash value no. 1 , are stored in the blockchain (verification service) and now reflect the exact status of the data at the time t or all after the time t newly added data until time t + 1 again. A comparison of the raw data from the database is now possible at any time against the value of the blockchain and runs at random in the background, per data slice. If there is a deviation in the comparison, the database inventory in the indexing service can be restored from the raw data of the storage service.

Das erfindungsgemäße System nutzt eine verteilte Infrastruktur, die lokale Computer, lokale Netzwerke und eine oder mehrere Clouds beinhalten kann.
Eine Cloud-Datenverarbeitung ist ein Modell einer Dienstbereitstellung, um einen komfortablen, bedarfsgesteuerten Netzwerkzugriff auf einen gemeinsam genutzten Vorrat von konfigurierbaren Datenverarbeitungsressourcen (z.B. Netzwerke, Netzwerkbandbreite, Server, Verarbeitung, Arbeitsspeicher, Speicher, Anwendungen, virtuelle Maschinen und Dienste) zu ermöglichen, die mit möglichst geringem Verwaltungsaufwand und möglichst wenig Interaktion mit einem Anbieter des Dienstes schnell bereitgestellt und freigegeben werden können. Für die cloudbasierte Nutzung des erfindungsgemäßen Systems, sind insbesondere folgende Dienstmodelle geeignet:

Software as a Service (SaaS): Die dem Verbraucher bereitgestellte Fähigkeit besteht darin, die in einer Cloudlnfrastruktur ausgeführten Anwendungen des Anbieters zu verwenden. Der Zugriff auf die Anwendungen kann über eine Thin-Client-Schnittstelle wie z.B. einen Webbrowser von verschiedenen Client-Einheiten aus erfolgen (z.B. eine eMail-Nachricht auf Grundlage des Webs). Mit Ausnahme beschränkter benutzerspezifischer Einstellungen der Anwendungskonfiguration wird die darunterliegende Cloud-Infrastruktur wie Netzwerk, Server, Betriebssysteme, Speicher oder auch einzelne Anwendungsfunktionen vom Verbraucher weder verwaltet noch kontrolliert.
Platform as a Service (PaaS): Die dem Verbraucher bereitgestellte Fähigkeit besteht darin, vom Benutzer erzeugte oder erworbene Anwendungen, die anhand von vom Anbieter bereitgestellten Programmiersprachen und Werkzeugen erstellt wurden, in der Cloud-Infrastruktur bereitzustellen. Die darunterliegende Infrastruktur wie Netzwerke, Server, Betriebssysteme oder Speicher wird vom Verbraucher weder verwaltet noch kontrolliert, er hat jedoch die Kontrolle über die bereitgestellten Anwendungen und möglicherweise über Konfigurationen der HostingUmgebung für die Anwendungen.
Infrastructure as a Service (IaaS): Die dem Verbraucher bereitgestellte Fähigkeit besteht darin, Verarbeitung, Speicher, Netzwerke und andere grundlegende Datenverarbeitungsressourcen bereitzustellen, wobei der Verbraucher in der Lage ist, frei wählbare Software wie z.B. Betriebssysteme und Anwendungen bereitzustellen und auszuführen. Die darunterliegende Cloud-Infrastruktur wird vom Verbraucher weder verwaltet noch kontrolliert, er hat jedoch die Kontrolle über Systeme und Einheiten (z.B. Betriebssysteme, Speicher, bereitgestellte An wendungen usw.) und möglicherweise eingeschränkte Kontrolle über ausgewählte Netzwerkkomponenten (z.B. Host-Firewalls).The system according to the invention uses a distributed infrastructure, which can include local computers, local networks and one or more clouds.
Cloud data processing is a model of a service provision to enable convenient, demand-controlled network access to a shared pool of configurable data processing resources (e.g. networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines and services) that work with the least possible administrative effort and the least possible interaction with a provider of the service can be made available and released quickly. The following service models are particularly suitable for the cloud-based use of the system according to the invention:

Software as a Service (SaaS): The ability provided to the consumer is to use the provider's applications running in a cloud infrastructure. The applications can be accessed via a thin client interface such as a web browser from different client units (e.g. an email message based on the web). With the exception of limited user-specific settings of the application configuration, the underlying cloud infrastructure such as network, server, operating systems, storage or individual application functions are neither managed nor controlled by the consumer.
Platform as a Service (PaaS): The ability provided to the consumer is to deploy user-generated or purchased applications, which were created using programming languages and tools provided by the provider, in the cloud infrastructure. The underlying infrastructure, such as networks, servers, operating systems, or storage, is neither managed nor controlled by the consumer, but he has control over the applications provided and possibly configurations of the hosting environment for the applications.
Infrastructure as a Service (IaaS): The ability provided to the consumer is to provide processing, storage, networks and other basic data processing resources, whereby the consumer is able to provide and run arbitrary software such as operating systems and applications. The underlying cloud infrastructure is neither managed nor controlled by the consumer, but he has control over systems and units (e.g. operating systems, storage, applications provided, etc.) and possibly limited control over selected network components (e.g. host firewalls).

Neben dem erfindungsgemäßen System stellen

• ein Verfahren zur Verarbeitung und Speicherung von archivierungspflichtigen Daten, das die Dienste ( )
1. a) Access Service (Zugang),
2. b) Analytics Service
3. c) Indexing Service (Indizierung),
4. d) Data Verification Service (Datenverifizierung) und
5. e) Storage Service (Speicherung)
umfasst, wobei die Dienste in einer verteilten Infrastruktur ablaufen und der

- Access Service durch einen Webservice, der
- Analytics Service durch einen Cluster Computing Service, der
- Indexing Service durch einen NoSQL Store, der
- Data Verification Service durch eine Blockchain und der
- Storage Service durch einen File Store realisiert wird.

sowie

• ein Computerprogrammprodukt, das auf einem nichtflüchtigen (nicht-transitorischen) computerlesbaren Medium computerlesbare Programmbefehle zur Ausführung eines erfindungsgemäßen Verfahrens beinhaltet, die einen Computer oder einen Verbund von Computern, beispielsweise in einer Cloud, veranlassen, die Schritte des erfindungsgsgemäßen Verfahrens auszuführen,

weitere Gegenstände der vorliegenden Erfindung dar.Place next to the system according to the invention

• a process for the processing and storage of data requiring archiving, which the services ( )
1. a) Access Service,
2. b) Analytics service
3. c) Indexing Service (indexing),
4. d) Data Verification Service and
5. e) Storage Service
includes, where the services run in a distributed infrastructure and the

- Access service through a web service that
- Analytics service through a cluster computing service that
- Indexing service through a NoSQL store that
- Data verification service through a blockchain and the
- Storage service is implemented through a file store.

such as

A computer program product which contains computer-readable program instructions for executing a method according to the invention on a non-volatile (non-transitory) computer-readable medium and which cause a computer or a combination of computers, for example in a cloud, to carry out the steps of the method according to the invention,

further objects of the present invention.

Das erfindungsgemäße System weist mindestens einen Arbeitsspeicher, mindestens ein Prozessorsystem, das kommunikativ mit dem mindestens einen Arbeitsspeicher verbunden ist, mindestens eine Eingabe- und eine Ausgabevorrichtung sowie insbesondere ein in einer Cloud-Architektur verteiltes Dateisystem auf. Vorrichtungen zur Daten-, Sprach-, Text- und /oder Bildeingabe und -ausgabe sind aus dem Stand der Technik bekannt und dem Durchschnittsfachmann geläufig.The system according to the invention has at least one working memory, at least one processor system which is communicatively connected to the at least one working memory, at least one input and one output device and in particular a file system distributed in a cloud architecture. Devices for data, voice, text and / or image input and output are known from the prior art and familiar to the person skilled in the art.

Das erfindungsgemäße Computerprogrammprodukt kann ein computerlesbares Speichermedium (oder -medien) mit darauf enthaltenen computerlesbaren Programmbefehlen beinhalten, um einen Prozessor zu veranlassen, Aspekte der vorliegenden Erfindung durchzuführen. Das computerlesbare Speichermedium kann eine gegenständliche Einheit sein, die Befehle zur Verwendung durch eine Befehlsausführungseinheit beibehalten und speichern kann. Das computerlesbare Speichermedium kann zum Beispiel eine elektronische Speichereinheit, eine magnetische Speichereinheit, eine optische Speichereinheit, eine elektromagnetische Speichereinheit, eine Halbleiterspeichereinheit oder eine beliebige geeignete Kombination der vorgenannten Einheiten sein, ohne jedoch darauf beschränkt zu sein. Eine nicht vollständige Liste konkreterer Beispiele des computerlesbaren Speichermediums beinhaltet Folgendes: eine tragbare Computerdiskette, eine Festplatte, einen Direktzugriffsspeicher (RAM), einen Festwertspeicher (ROM), einen löschbaren, programmierbaren NurLese-Speicher (EPROM- oder Flash-Speicher), einen statischen Direktzugriffsspeicher (SRAM), einen tragbaren CD-ROM, eine DVD, einen Speicher-Stick, eine Diskette, eine mechanisch codierte Einheit wie z.B. Lochkarten oder erhabene Strukturen in einer Rille mit darauf aufgezeichneten Befehlen sowie eine beliebige geeignete Kombination der vorgenannten Elemente.
Bei einem computerlesbaren Speichermedium, wie es hier verwendet wird, ist nicht davon auszugehen, dass es sich an sich um flüchtige Signale wie z.B. Funkwellen oder andere sich frei ausbreitende elektromagnetische Wellen, elektromagnetische Wellen, die sich durch einen Hohlleiter oder ein anderes Übertragungsmedien ausbreiten (z.B. Lichtimpulse, die ein Lichtwellenleiterkabel durchlaufen), oder elektrische Signale, die über eine Leitung übertragen werden, handelt.
Hier beschriebene computerlesbare Programmbefehle können über ein Netzwerk wie beispielsweise das Internet, ein LAN, ein WAN und/oder ein drahtloses Netzwerk von einem computerlesbaren Speichermedium auf entsprechende Datenverarbeitungseinheiten oder auf einen externen Computer oder eine externe Speichereinheit heruntergeladen werden. Das Netzwerk kann Kupferübertragungskabel, Lichtwellenleiter, eine drahtlose Übertragung, Router, Firewalls, Switches, Gateway-Computer und/oder Edge-Server aufweisen. Eine Netzwerkadapterkarte oder Netzwerkschnittstelle in jeder Datenverarbeitungseinheit empfängt computerlesbare Programmbefehle von dem Netzwerk und leitet die computerlesbaren Programmbefehle zur Speicherung auf einem computerlesbaren Speichermedium innerhalb der betreffenden Datenverarbeitungseinheit weiter. Bei computerlesbaren Programmbefehlen zum Durchführen von Operationen der vorliegenden Offenbarung kann es sich um Assembler-Befehle, ISA-Befehle (Instruction-Set-Architecture), Maschinenbefehle, maschinenabhängige Befehle, Mikrocode, Firmwarebefehle, einen Zustand festlegende Daten oder aber entweder um Quellcode oder um Objektcode handeln, der in einer beliebigen Kombination von einer oder mehreren Programmiersprachen wie z.B. Java Script oder einer objektorientierten Programmiersprache wie Java, Smalltalk, C++ oder dergleichen sowie in herkömmlichen prozeduralen Programmiersprachen wie z.B. der Programmiersprache „C“ oder ähnlichen Programmiersprachen geschrieben ist.
Allerdings sind die Ausführungsformen der Erfindung nicht mit Bezug auf eine bestimmte Programmiersprache beschrieben. Es versteht sich, dass eine Vielzahl von Programmiersprachen verwendet werden kann, um verschiedene Ausführungsformen der Erfindung, die hierin beschrieben ist, zu implementieren und Verweise auf eine spezifische Sprache lediglich beispielhafte Ausführungsformen der Erfindung darstellen.
Schließlich sollte beachtet werden, dass die Sprache, die in der Beschreibung verwendet wurde, hauptsächlich aus Gründen der Lesbarkeit und des besseren Verständnisses gewählt wurde, und nicht, um den Gegenstand der Erfindung hierauf zu beschränken. Dementsprechend soll die Offenbarung der Ausführungsformen der Erfindung ein Anschauungsbeispiel sein und nicht den Umfang der Erfindung, die in den folgenden Ansprüchen dargelegt ist, einschränken.
Die computerlesbaren Programmbefehle können vollständig auf dem Computer des Benutzers, teilweise auf dem Computer des Benutzers, als eigenständiges Softwarepaket, teilweise auf dem Computer des Benutzers und teilweise auf einem entfernt angeordneten Computer oder aber vollständig auf dem entfernt angeordneten Computer oder Server ausgeführt werden. Im letztgenannten Szenario kann der entfernt angeordnete Computer über eine beliebige Art von Netzwerk, unter anderem ein LAN oder ein WAN, mit dem Computer des Benutzers verbunden sein, oder die Verbindung kann mit einem externen Computer (z.B. über das Internet unter Verwendung eines Internet-Dienstanbieters) hergestellt werden. Bei manchen Ausführungsformen kann ein elektronischer Schaltkreis wie z.B. ein programmierbarer Logikschaltkreis, Field-Programmable-Gate-Arrays (FPGAs) oder Programmable-LogicArrays (PLAs) die computerlesbaren Programmbefehle ausführen, indem er Zustandsdaten der computerlesbaren Programmbefehle verwendet, um die elektronische Schaltung zu personalisieren und Aspekte der vorliegenden Erfindung durchzuführen. Ausführungsformen der vorliegenden Erfindung werden hier unter anderem durch Darstellungen von Ablaufplänen von Verfahren, Vorrichtungen (Systemen) und Computerprogrammprodukten beschrieben. Dabei dürfte klar sein, dass jeder Ablaufplan oder Teil eines Ablaufplanes durch computerlesbare Programmbefehle realisiert werden kann.
Diese computerlesbaren Programmbefehle können einem Prozessor eines Universalcomputers, Spezialcomputers oder einer anderweitigen programmierbaren Datenverarbeitungsvorrichtung bereitgestellt werden, um eine Maschine zu erzeugen, so dass die Befehle, die über den Prozessor des Computers oder der anderweitigen programmierbaren Datenverarbeitungsvorrichtung ausgeführt werden, ein Mittel erzeugen, mit dem die Funktionen/Handlungen realisiert werden können, die in dem Ablaufplan angegeben werden. Diese computerlesbaren Programmbefehle können auch auf einem computerlesbaren Speichermedium gespeichert sein, das einen Computer, eine programmierbare Datenverarbeitungsvorrichtung und/oder andere Einheiten anweisen kann, auf eine bestimmte Art und Weise zu funktionieren, so dass das computerlesbare Speichermedium mit darauf gespeicherten Befehlen einen Herstellungsartikel aufweist, der Befehle enthält, welche Aspekte der in dem Ablaufplan angegebenen Funktionen oder Handlungen realisieren.
Die computerlesbaren Programmbefehle können zudem in einen Computer, eine anderweitige programmierbare Datenverarbeitungsvorrichtung oder eine andere Einheit geladen werden, um zu veranlassen, dass eine Reihe von Funktionsschritten auf dem Computer, der anderweitigen programmierbaren Datenvorrichtung oder der anderen Einheit ausgeführt wird, so dass die Befehle, die auf dem Computer, der anderweitigen Datenverarbeitungsvorrichtung oder der anderen Einheit ausgeführt werden, die in dem Ablaufplan angegebenen Funktionen/ Handlungen realisieren. Die Ablaufpläne in den Figuren stellen die Architektur, Funktionalität und den Betrieb möglicher Realisierungen von Systemen, Verfahren und Computerprogrammprodukten gemäß verschiedenen Ausführungsformen der vorliegenden Erfindung bereit. Somit kann jeder Teil der Ablaufpläne ein Modul, Segment oder einen Teil von Befehlen darstellen, das/der einen oder mehrere ausführbare Befehle aufweist, mit denen sich die eine oder mehreren angegebenen logischen Funktionen realisieren lassen. Bei manchen alternativen Ausführungsformen können die in dem Ablaufplan erwähnten Funktionen in einer anderen Reihenfolge als der in den Figuren genannten auftreten. So können zwei aufeinanderfolgend dargestellte Abläufe tatsächlich im Wesentlichen gleichzeitig stattfinden, oder die Abläufe können mitunter in umgekehrter Reihenfolge ausgeführt werden, wobei dies abhängig von der betreffenden Funktionalität ist.
Zu erwähnen ist ebenfalls, dass jeder Teil der Ablaufplan-Darstellungen sowie Kombinationen von Teilen von Ablaufplan-Darstellungen durch Spezialsysteme auf Hardwaregrundlage, welche die angegebenen Funktionen oder Handlungen oder Kombinationen hiervon ausführen, oder durch Kombinationen von Spezial-Hardware- und Computerbefehlen realisiert bzw. durchgeführt werden kann/können.
Die hier verwendete Begrifflichkeit dient lediglich zur Beschreibung bestimmter Ausführungsformen der vorliegenden Erfindung und ist nicht als Beschränkung der vorliegenden Erfindung gedacht.
Im hier verwendeten Sinne sollen die Singularformen „ein/e/r“ und „der/die/das“ auch die Pluralformen beinhalten, sofern der Kontext dies nicht eindeutig anderweitig vorgibt. Ebenso offensichtlich dürfte sein, dass die Verben „weist auf“ und/oder „aufweisend“ in dieser Schutzrechtsanmeldung das Vorhandensein der genannten Merkmale, Ganzzahlen, Schritte, Vorgänge, Elemente und/oder Komponenten angeben, ohne jedoch das Vorhandensein oder die Hinzufügung eines oder mehrerer anderer Merkmale, Ganzzahlen, Schritte, Vorgänge, Elementkomponenten und/oder Gruppen derselben auszuschließen.
Die betreffenden Strukturen, Materialien, Handlungen und Äquivalente aller Mittel oder Schritte zusätzlich zu den Funktionselementen in den nachstehenden Ansprüchen sollen sämtliche Strukturen, Materialien oder Handlungen beinhalten, mit denen die Funktion in Verbindung mit anderen beanspruchten Elementen durchgeführt werden kann, wie sie hier ausdrücklich beansprucht sind. Die Beschreibung von Ausführungsformen der vorliegenden Erfindung wurde zum Zwecke der Veranschaulichung und Erläuterung vorgelegt und ist mit Blick auf die in dieser Form beschriebene Offenbarung nicht als vollständig oder beschränkend zu verstehen. Der Fachmann weiß, dass zahlreiche Änderungen und Abwandlungen möglich sind, ohne vom Umfang der Erfindung abzuweichen. Die Ausführungsformen der vorliegenden Erfindung wurden ausgewählt und beschrieben, um die Grundsätze der Erfindung und die praktische Anwendung bestmöglich zu erläutern und um anderen Fachleuten die Möglichkeit zu geben, die Erfindung und verschiedene Ausführungsformen hiervon mit verschiedenen Abwandlungen zu verstehen, wie sie für die jeweilige, in Erwägung gezogene Verwendung geeignet sind.The computer program product of the invention may include a computer readable storage medium (or media) with computer readable program instructions included thereon to cause a processor to implement aspects of the present invention. The computer readable storage medium may be a physical unit that can retain and store instructions for use by an instruction execution unit. The computer-readable storage medium can be, for example, an electronic storage unit, a magnetic storage unit, an optical storage unit, an electromagnetic storage unit, a semiconductor storage unit or any suitable combination of the aforementioned units, without being limited thereto. A non-exhaustive list of more specific examples of the computer readable storage medium includes: a portable computer diskette, a hard drive, a random access memory (RAM), a read only memory (ROM), an erasable, programmable read only memory (EPROM or flash memory), a static random access memory (SRAM), a portable CD-ROM, a DVD, a memory stick, a floppy disk, a mechanically coded unit such as punch cards or raised structures in a groove with commands recorded thereon, and any suitable combination of the aforementioned elements.
A computer-readable storage medium as used here cannot be assumed to be volatile signals such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves that propagate through a waveguide or other transmission media (e.g. Pulses of light that pass through an optical fiber cable) or electrical signals that are transmitted via a line.
Computer-readable program instructions described here can be downloaded from a computer-readable storage medium to corresponding data processing units or to an external computer or an external storage unit via a network such as the Internet, a LAN, a WAN and / or a wireless network. The network can include copper transmission cables, optical fibers, wireless transmission, routers, firewalls, switches, gateway computers and / or edge servers. A network adapter card or network interface in each computing device receives computer readable Program commands from the network and forwards the computer-readable program commands for storage on a computer-readable storage medium within the relevant data processing unit. Computer readable program instructions for performing operations of the present disclosure may be assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-defining data, or either source code or object code act that is written in any combination of one or more programming languages such as Java Script or an object-oriented programming language such as Java, Smalltalk, C ++ or the like as well as in conventional procedural programming languages such as the programming language "C" or similar programming languages.
However, the embodiments of the invention are not described with reference to any particular programming language. It is understood that a variety of programming languages can be used to implement various embodiments of the invention described herein and references to a specific language are only exemplary embodiments of the invention.
Finally, it should be noted that the language used in the description was chosen primarily for readability and better understanding, and not to limit the scope of the invention thereto. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative and not to limit the scope of the invention, which is set out in the following claims.
The computer-readable program instructions can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer and partly on a remotely located computer or entirely on the remotely located computer or server. In the latter scenario, the remote computer can be connected to the user's computer via any type of network, including a LAN or a WAN, or the connection can be connected to an external computer (e.g. over the Internet using an Internet service provider) ) getting produced. In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can execute the computer readable program instructions using state data of the computer readable program instructions to personalize the electronic circuit and Aspects of the present invention. Embodiments of the present invention are described here inter alia by representations of flowcharts of methods, devices (systems) and computer program products. It should be clear that any schedule or part of a schedule can be implemented using computer-readable program commands.
These computer readable program instructions may be provided to a general purpose processor, special purpose computer, or other programmable data processing device processor to create a machine so that the instructions executed by the computer processor or other programmable data processing device generate a means by which the Functions / actions can be implemented that are specified in the schedule. These computer readable program instructions can also be stored on a computer readable storage medium, which can instruct a computer, a programmable data processing device and / or other units to function in a certain way, so that the computer readable storage medium with instructions stored thereon has an article of manufacture which Commands contain which aspects of the functions or actions specified in the flowchart implement.
The computer readable program instructions may also be loaded into a computer, other programmable data processing device, or other device to cause a series of operational steps to be performed on the computer, other programmable data device, or the other device so that the commands that on the computer, the other data processing device or the other unit that perform the functions / actions specified in the flowchart. The flowcharts in the figures provide the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products in accordance with various embodiments of the present invention. Each part of the flowchart can thus represent a module, segment or part of instructions which has one or more executable instructions with which the one or more specified logic functions can be implemented. In some alternative embodiments, the functions mentioned in the flowchart may occur in a different order than that shown in the figures. For example, two sequences shown in succession can actually take place essentially simultaneously, or the sequences can sometimes be carried out in the reverse order, depending on the functionality concerned.
It should also be mentioned that each part of the flowchart representations and combinations of parts of flowchart representations are realized or carried out by special systems on a hardware basis, which carry out the specified functions or actions or combinations thereof, or by combinations of special hardware and computer commands can be.
The terminology used here is used only to describe certain embodiments of the present invention and is not intended to limit the present invention.
In the sense used here, the singular forms "ein / e / r" and "der / die / das" should also include the plural forms, unless the context clearly states otherwise. It should also be obvious that the verbs “points” and / or “having” in this patent application indicate the presence of the mentioned characteristics, integers, steps, processes, elements and / or components, but without the presence or addition of one or more other characteristics, integers, steps, operations, element components and / or groups thereof.
The relevant structures, materials, actions and equivalents of all means or steps in addition to the functional elements in the claims below are intended to include all structures, materials or actions with which the function can be performed in connection with other claimed elements, as expressly claimed here . The description of embodiments of the present invention has been presented for purposes of illustration and description and is not to be taken in light of the scope or limitation in view of the disclosure described in this form. Those skilled in the art know that numerous changes and modifications are possible without departing from the scope of the invention. The embodiments of the present invention have been selected and described in order to best explain the principles of the invention and its practical application, and to enable others of ordinary skill in the art to understand the invention and various embodiments thereof with various modifications as appropriate for the particular one in FIG Considered use are suitable.

Ein weiterer Gegenstand der vorliegenden Erfindung ist ein Computer-Datensignal (Datenträgersignal), das in eine Trägerwelle eingebunden ist, wobei das Computer-Datensignal jede Ausführungsform eines Computerprogramm-Produkts oder andere hierin beschriebene Datenkombinationen beinhaltet. Das Computer-Datensignal ist ein Produkt, das in einem materiellen Träger präsentiert und moduliert oder anderweitig in einer Trägerwelle codiert ist, die gemäß einer geeigneten Übertragungsmethode übertragen wird.Another object of the present invention is a computer data signal (data carrier signal) which is integrated into a carrier wave, the computer data signal including any embodiment of a computer program product or other data combinations described herein. The computer data signal is a product that is presented and modulated in a material carrier or otherwise encoded in a carrier wave that is transmitted according to a suitable transmission method.

BezugszeichenlisteReference symbol list

Abb. 1:Fig. 1:: Micro Service ArchitekturMicro service architecture
Abb. 2:Fig. 2:: Ablauf InsertExpiration insert
Abb. 3:Fig. 3:: RetrievalRetrieval
Abb. 4:Fig. 4:: Client Sicht auf Audit Trail (Insert & Repair)Client view of audit trail (insert & repair)
Abb. 5:Fig. 5:: Datenbank Hash zum Zeitpunkt t Database hash at the time t
Abb. 6:Fig. 6:: Datenbank Hashes zum Zeitpunkt t+1Database hashes at time t + 1

Claims

System for the processing and storage of data requiring archiving, which the services a) Access Service, b) Analytics service c) indexing service, d) Data Verification Service and e) includes storage service, the services running in a distributed infrastructure and the • Access service through a web service that • Analytics service through a cluster computing service that • Indexing service through a NoSQL store that • Data verification service through a blockchain and the • Storage service is implemented by a distributed file system.

System according to Claim 1 , whereby - the cluster computing service is implemented using the "Map Reduce" program model and / or - the system can be implemented as an "infrastructure agnostic service" regardless of hardware.

System according to Claim 1 or 2nd , whereby an object within the distributed infrastructure is monitored throughout the entire life cycle in such a way that manipulation is excluded by using a transaction bracket that is divided into four parts, namely • Write, • Read Repair, • Housekeeping Repair and • Audit Trail and in particular - during the "Write" an object is accepted with the Access Service by encapsulating the following technical steps when receiving the document: i) Receive a document and calculate the SHA256 hash of the Contents, as well as a unique GUID. ii) Write the GUID, the date of processing and the hash in a blockchain, including a "type". iii) Simultaneously with step ii): Put the raw data in the connected file system and then verify the hash from the file system against the hash from the blockchain. iv) Return OK or not OK to the client. and / or - the procedure for “read repair” is as follows: i) a client requests a document via a GUID. ii) The access service searches for the suitable hash from the blockchain for the GUID and compares it to the raw data of the file system. iii) If the hash from the blockchain is equal to the calculated hash of the file store, the document is delivered, otherwise the client receives the corresponding information. and / or - with "Housekeeping Repair" as with "Read Repair", the "Housekeeping Repair" is time-controlled and runs continuously as a background task, which constantly compares the hash values of the blockchain with the raw data of the file store to ensure that no data is lost or corrupted unnoticed. and / or - the "Audit Trail" repairs and the complete deletion of raw data, after the expiry of the legal retention period and all other relevant events are anchored as a separate event within the blockchain.

System according to one of the preceding claims, which includes a “floating data domain model”, this model being able to translate the legal requirements for data storage into technical operation by allowing, on the basis of defined rules, Controlling the storage of raw data in a targeted manner and in particular • in a first step, geolocations are selected that correspond to the legal requirements for data storage to be satisfied, • in a second step, storage classes are defined for each use case and • In a third step, a certain number of different storage classes is automatically used for the corresponding raw data set, while maintaining defined geolocations.

System according to one of the preceding claims, the at least one working memory, at least one processor system that is communicatively connected to the at least one working memory, has at least one input and one output device and in particular a file system distributed in a cloud architecture and / or that - Control software components required by using Docker Containers in a Kubernetes environment regardless of location and in different cloud services and / or that - Passes data to the indexing service in a "read after write" procedure to ensure that each data record has actually been written before the responsible process can assume that the writing process has been carried out correctly and completely, the procedure as follows expires: (a) Parsing the data (b) Build indexing structure (c) Write structure in indexing service (d) Reading the indexing structure after the writing process has ended (e) Compare the result of the reading process with the supplied input (f) If the same, complete the transaction successfully; If not the same, repeat the transaction

Process for the processing and storage of data requiring archiving, which includes the services a) Access Service, b) Analytics Service c) Indexing Service, d) Data Verification Service and e) Storage Service, whereby the services run in a distributed infrastructure and the • Access Service by a web service, the • Analytics service by a cluster computing service, the • Indexing service by a NoSQL store, the • Data verification service by a blockchain and the • Storage service is implemented by a distributed file system.

Computer program product that executes computer-readable program instructions on a non-volatile computer-readable medium for executing the method Claim 6 involves causing a computer or a cluster of computers to follow the steps of the method Claim 6 to execute.

Computer program product after Claim 7 , the computer-readable program instructions for executing the method on a non-volatile computer-readable medium Claim 6 which, if they run on one computer or a combination of computers, a system according to one of the Claims 1 to 5 provide.

Computer program product after Claim 7 or 8th , wherein the computer readable medium is a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable, programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable CD -ROM, a DVD, a memory stick, a mechanically coded unit or any suitable combination of the aforementioned elements.

A computer data signal incorporated into a carrier wave transmitted in accordance with a suitable transmission method, the computer data signal comprising each embodiment of the computer program product according to one of the Claims 7 to 9 includes.